# Knowledge Distillation for EfficientNetV2 on CIFAR-100 - Version 3.1 (Tuned DKD)

This notebook implements **State-of-the-Art** knowledge distillation with:

- **Decoupled Knowledge Distillation (DKD)** - Separates target and non-target class knowledge (Zhao et al., CVPR 2022)
- **AutoAugment + Random Erasing** - Advanced data augmentation
- **CutMix + Mixup** - Sample mixing strategies
- **Label Smoothing + LR Warmup** - Training stability

---

## Version History:

- **v1 (Baseline):** Standard KD + CutMix/Mixup → 72.34% accuracy
- **v2 (Enhanced):** + AutoAugment + Label Smoothing + LR Warmup → 74.20% accuracy
- **v3 (DKD):** + Decoupled Knowledge Distillation (β=8.0) → 72.69% (over-regularized)
- **v3.1 (Tuned DKD):** Reduced β from 8.0 to 2.0 → Target: 75%+ accuracy

---

## Key Changes in v3.1:

1. **Reduced DKD_BETA** from 8.0 to 2.0 (less aggressive NCKD weight)
2. **Reuse Teacher v3** (76.02% accuracy) - no retraining needed
3. **New student model name:** `student_v3_1_dkd_beta2`

**Diagnosis:** v3 suffered from over-regularization due to high β combined with strong augmentation (Mixup/CutMix).

---


In [1]:
# Cell 1: Setup and Imports
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision.models import efficientnet_v2_s, efficientnet_v2_l, EfficientNet_V2_S_Weights, EfficientNet_V2_L_Weights
from tqdm import tqdm
import copy
import numpy as np
import matplotlib.pyplot as plt
import glob
import random
from pathlib import Path

# Setup Directories (Local paths)
PROJECT_ROOT = Path('./outputs')
MODEL_DIR = PROJECT_ROOT / 'models'
DATA_DIR = PROJECT_ROOT / 'data'
CHECKPOINT_DIR = PROJECT_ROOT / 'checkpoints'

MODEL_DIR.mkdir(parents=True, exist_ok=True)
DATA_DIR.mkdir(parents=True, exist_ok=True)
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Directories ready at: {PROJECT_ROOT.absolute()}")

# Reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    torch.backends.cudnn.benchmark = True

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Directories ready at: d:\Projects\MasterProject\code\outputs
Using device: cuda
GPU: NVIDIA GeForce RTX 5070 Laptop GPU


In [2]:
# Cell 2: Experiment Configuration (Version 3.1 - Tuned DKD Beta)
# ==========================================
# HYPERPARAMETERS
# ==========================================
NUM_EPOCHS = 200
TEACHER_EPOCHS = 300        # Extended teacher training for v3
BATCH_SIZE = 128
LEARNING_RATE = 0.001
WEIGHT_DECAY = 0.05
PATIENCE = 30
WARMUP_EPOCHS = 5           # LR warmup

# Distillation Params (DKD - Tuned for v3.1)
DKD_ALPHA = 1.0             # Weight for Target Class Knowledge Distillation (TCKD)
DKD_BETA = 2.0              # REDUCED from 8.0 to 2.0 - Less aggressive NCKD
TEMPERATURE = 4.0           # Softmax Temperature
LABEL_SMOOTHING = 0.1       # Label smoothing for hard loss

# Augmentation Params
MIXUP_ALPHA = 0.8
CUTMIX_ALPHA = 1.0
CHECKPOINT_FREQUENCY = 20
NUM_CLASSES = 100

# Version 3.1 Model Names
TEACHER_NAME = "teacher_v3"              # Reuse the strong teacher (76.02%)
STUDENT_NAME = "student_v3_1_dkd_beta2"  # New student with tuned beta

print(f"{'='*60}")
print(f"VERSION 3.1 CONFIG - Tuned Beta Parameter")
print(f"{'='*60}")
print(f"Teacher: {TEACHER_NAME} (Accuracy: 76.02%)")
print(f"Training: Epochs={NUM_EPOCHS} | Batch={BATCH_SIZE} | LR={LEARNING_RATE}")
print(f"DKD: Alpha(TCKD)={DKD_ALPHA}, Beta(NCKD)={DKD_BETA} (REDUCED from 8.0)")
print(f"Augmentation: AutoAugment + RandomErasing + CutMix/Mixup")
print(f"Student Model: {STUDENT_NAME}")
print(f"{'='*60}")

VERSION 3.1 CONFIG - Tuned Beta Parameter
Teacher: teacher_v3 (Accuracy: 76.02%)
Training: Epochs=200 | Batch=128 | LR=0.001
DKD: Alpha(TCKD)=1.0, Beta(NCKD)=2.0 (REDUCED from 8.0)
Augmentation: AutoAugment + RandomErasing + CutMix/Mixup
Student Model: student_v3_1_dkd_beta2


In [3]:
# Cell 3: Data Loading (Enhanced with AutoAugment + RandomErasing)
from torchvision.transforms import autoaugment

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    autoaugment.AutoAugment(policy=autoaugment.AutoAugmentPolicy.CIFAR10),
    transforms.ToTensor(),
    transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
    transforms.RandomErasing(p=0.25, scale=(0.02, 0.2)),  # Cutout-like augmentation
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
])

trainset = torchvision.datasets.CIFAR100(root=str(DATA_DIR), train=True, download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True, 
                                          num_workers=4, pin_memory=True, drop_last=True)

testset = torchvision.datasets.CIFAR100(root=str(DATA_DIR), train=False, download=True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size=BATCH_SIZE, shuffle=False, 
                                         num_workers=4, pin_memory=True)

print(f"Data loaded: {len(trainset)} Training, {len(testset)} Test images")
print(f"  Augmentation: AutoAugment + RandomErasing + CutMix/Mixup")

Data loaded: 50000 Training, 10000 Test images
  Augmentation: AutoAugment + RandomErasing + CutMix/Mixup


In [4]:
# Cell 4: Helper Functions (DKD Loss + Augmentations + Utilities)

# ==========================================
# 1. Decoupled Knowledge Distillation (DKD) Loss
# Based on: Zhao et al. "Decoupled Knowledge Distillation", CVPR 2022
# https://openaccess.thecvf.com/content/CVPR2022/papers/Zhao_Decoupled_Knowledge_Distillation_CVPR_2022_paper.pdf
# ==========================================

def _get_gt_mask(logits, target):
    """Get mask for target (ground truth) class"""
    target = target.reshape(-1)
    mask = torch.zeros_like(logits).scatter_(1, target.unsqueeze(1), 1).bool()
    return mask

def _get_other_mask(logits, target):
    """Get mask for non-target classes"""
    target = target.reshape(-1)
    mask = torch.ones_like(logits).scatter_(1, target.unsqueeze(1), 0).bool()
    return mask

def cat_mask(t, mask1, mask2):
    """Concatenate masked values"""
    t1 = (t * mask1).sum(dim=1, keepdims=True)
    t2 = (t * mask2).sum(1, keepdims=True)
    rt = torch.cat([t1, t2], dim=1)
    return rt

def dkd_loss(student_logits, teacher_logits, target, alpha=1.0, beta=8.0, temperature=4.0):
    """
    Decoupled Knowledge Distillation (DKD) Loss.
    
    Separates knowledge into:
    - TCKD (Target Class Knowledge Distillation): Knowledge about the correct class
    - NCKD (Non-Target Class Knowledge Distillation): Knowledge about class relationships
    
    Args:
        student_logits: Student model outputs
        teacher_logits: Teacher model outputs
        target: Ground truth labels
        alpha: Weight for TCKD (default: 1.0)
        beta: Weight for NCKD (default: 8.0, crucial for performance)
        temperature: Softmax temperature (default: 4.0)
    
    Returns:
        Combined DKD loss
    """
    # Get masks for target and non-target classes
    gt_mask = _get_gt_mask(student_logits, target)
    other_mask = _get_other_mask(student_logits, target)
    
    # Calculate probabilities with temperature scaling
    pred_student = F.softmax(student_logits / temperature, dim=1)
    pred_teacher = F.softmax(teacher_logits / temperature, dim=1)
    
    # Separate target and non-target probabilities
    pred_student_cat = cat_mask(pred_student, gt_mask, other_mask)
    pred_teacher_cat = cat_mask(pred_teacher, gt_mask, other_mask)
    
    # Target Class Knowledge Distillation (TCKD)
    log_pred_student = torch.log(pred_student_cat + 1e-8)
    tckd_loss = (
        F.kl_div(log_pred_student, pred_teacher_cat, reduction='batchmean')
        * (temperature ** 2)
    )
    
    # Non-Target Class Knowledge Distillation (NCKD)
    # Mask out target class with large negative value
    pred_teacher_part2 = F.softmax(
        teacher_logits / temperature - 1000.0 * gt_mask, dim=1
    )
    log_pred_student_part2 = F.log_softmax(
        student_logits / temperature - 1000.0 * gt_mask, dim=1
    )
    
    nckd_loss = (
        F.kl_div(log_pred_student_part2, pred_teacher_part2, reduction='batchmean')
        * (temperature ** 2)
    )
    
    # Combined DKD loss
    return alpha * tckd_loss + beta * nckd_loss

def dkd_loss_mixup(student_logits, teacher_logits, labels_a, labels_b, lam,
                   alpha=1.0, beta=8.0, temperature=4.0, label_smoothing=0.1):
    """
    DKD Loss for mixed samples (CutMix/Mixup compatible)
    Uses weighted combination of DKD losses for both label sets
    """
    # DKD for both label sets
    dkd_a = dkd_loss(student_logits, teacher_logits, labels_a, alpha, beta, temperature)
    dkd_b = dkd_loss(student_logits, teacher_logits, labels_b, alpha, beta, temperature)
    
    # Weighted DKD loss
    soft_loss = lam * dkd_a + (1 - lam) * dkd_b
    
    # Hard loss with label smoothing
    hard_loss = lam * F.cross_entropy(student_logits, labels_a, label_smoothing=label_smoothing) + \
                (1 - lam) * F.cross_entropy(student_logits, labels_b, label_smoothing=label_smoothing)
    
    # Combined loss (DKD already weighted by alpha/beta, so we just add hard loss)
    # Using 0.1 weight for hard loss to not overpower DKD
    loss = soft_loss + 0.1 * hard_loss
    
    # NaN safety check
    if torch.isnan(loss):
        return hard_loss
    
    return loss

# ==========================================
# 2. Standard KD Loss (kept for comparison)
# ==========================================
def kd_loss(student_logits, teacher_logits, labels, temp=4.0, alpha=0.7, label_smoothing=0.1):
    """Standard Knowledge Distillation Loss (for comparison)"""
    soft_student = F.log_softmax(student_logits / temp, dim=1)
    soft_teacher = F.softmax(teacher_logits / temp, dim=1)
    soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (temp ** 2)
    hard_loss = F.cross_entropy(student_logits, labels, label_smoothing=label_smoothing)
    return alpha * soft_loss + (1 - alpha) * hard_loss

# ==========================================
# 3. Augmentations: Mixup & CutMix
# ==========================================
def mixup_data(x, y, alpha=1.0):
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1
    batch_size = x.size()[0]
    index = torch.randperm(batch_size).to(device)
    mixed_x = lam * x + (1 - lam) * x[index, :]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

def rand_bbox(size, lam):
    W = size[2]
    H = size[3]
    cut_rat = np.sqrt(1. - lam)
    cut_w = int(W * cut_rat)
    cut_h = int(H * cut_rat)
    cx = np.random.randint(W)
    cy = np.random.randint(H)
    bbx1 = np.clip(cx - cut_w // 2, 0, W)
    bby1 = np.clip(cy - cut_h // 2, 0, H)
    bbx2 = np.clip(cx + cut_w // 2, 0, W)
    bby2 = np.clip(cy + cut_h // 2, 0, H)
    return bbx1, bby1, bbx2, bby2

def cutmix_data(x, y, alpha=1.0):
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1
    batch_size = x.size()[0]
    index = torch.randperm(batch_size).to(device)
    bbx1, bby1, bbx2, bby2 = rand_bbox(x.size(), lam)
    x[:, :, bbx1:bbx2, bby1:bby2] = x[index, :, bbx1:bbx2, bby1:bby2]
    lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (x.size()[-1] * x.size()[-2]))
    y_a, y_b = y, y[index]
    return x, y_a, y_b, lam

# ==========================================
# 4. Utilities (Save/Load/Evaluate)
# ==========================================
def evaluate_model_with_loss(model, dataloader, criterion):
    model.eval()
    correct = 0
    total = 0
    running_loss = 0.0
    with torch.no_grad():
        for data in dataloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    acc = 100 * correct / total
    avg_loss = running_loss / len(dataloader)
    return acc, avg_loss

def save_checkpoint(model, optimizer, scheduler, epoch, best_acc, history, model_name, epochs_no_improve):
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler_state_dict': scheduler.state_dict(),
        'best_acc': best_acc,
        'history': history,
        'epochs_no_improve': epochs_no_improve
    }
    path = CHECKPOINT_DIR / f"{model_name}_epoch{epoch+1}.pth"
    torch.save(checkpoint, path)
    print(f"  Checkpoint saved: {path}")
    
def load_checkpoint(model, optimizer, scheduler, model_name):
    checkpoints = sorted(glob.glob(str(CHECKPOINT_DIR / f"{model_name}_epoch*.pth")))
    if not checkpoints:
        return None
    latest = checkpoints[-1]
    print(f"  Loading checkpoint: {latest}")
    checkpoint = torch.load(latest, map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    return checkpoint

def cleanup_old_checkpoints(model_name, keep=3):
    checkpoints = sorted(glob.glob(str(CHECKPOINT_DIR / f"{model_name}_epoch*.pth")))
    if len(checkpoints) > keep:
        for chk in checkpoints[:-keep]:
            os.remove(chk)
            print(f"  Cleaned up: {os.path.basename(chk)}")

print("Helper functions loaded:")
print("  - DKD Loss (Decoupled Knowledge Distillation)")
print("  - Standard KD Loss (for comparison)")
print("  - CutMix, Mixup augmentations")
print("  - Checkpoint utilities")

Helper functions loaded:
  - DKD Loss (Decoupled Knowledge Distillation)
  - Standard KD Loss (for comparison)
  - CutMix, Mixup augmentations
  - Checkpoint utilities


In [5]:
# Cell 5: Training Loop (Version 3 - with DKD support)
def train_model_v3(model, dataloader, optimizer, scheduler, num_epochs, model_name, 
                   teacher_model=None, use_dkd=True,
                   dkd_alpha=1.0, dkd_beta=8.0, temperature=4.0, label_smoothing=0.1,
                   patience=30, grad_clip=1.0, warmup_epochs=5):
    """
    Version 3 Training loop with:
    - Decoupled Knowledge Distillation (DKD) loss
    - CutMix/Mixup augmentation
    - Learning Rate Warmup
    - Mixed precision training
    - Gradient clipping
    - NaN detection and recovery
    """
    
    # 1. Load Checkpoint
    checkpoint = load_checkpoint(model, optimizer, scheduler, model_name)
    if checkpoint:
        start_epoch = checkpoint['epoch'] + 1
        best_acc = checkpoint['best_acc']
        history = checkpoint['history']
        epochs_no_improve = checkpoint['epochs_no_improve']
        best_model_wts = copy.deepcopy(model.state_dict())
        print(f"  Resuming from epoch {start_epoch}, Best Acc: {best_acc:.2f}%")
    else:
        start_epoch = 0
        best_acc = 0.0
        history = {'train_loss': [], 'val_loss': [], 'val_accuracy': []}
        epochs_no_improve = 0
        best_model_wts = copy.deepcopy(model.state_dict())
        print(f"  Starting fresh training...")

    # 2. Setup
    scaler = torch.amp.GradScaler('cuda', enabled=torch.cuda.is_available())
    val_criterion = nn.CrossEntropyLoss()
    
    # Store base LR for warmup
    base_lr = optimizer.param_groups[0]['lr']
    
    if teacher_model:
        teacher_model.eval()
        for param in teacher_model.parameters():
            param.requires_grad = False
        loss_type = "DKD" if use_dkd else "Standard KD"
        print(f"  Using {loss_type} loss with Teacher guidance")

    # 3. Training Loop
    for epoch in range(start_epoch, num_epochs):
        model.train()
        running_loss = 0.0
        valid_batches = 0
        
        # Learning Rate Warmup
        if epoch < warmup_epochs:
            warmup_lr = base_lr * (epoch + 1) / warmup_epochs
            for param_group in optimizer.param_groups:
                param_group['lr'] = warmup_lr
        
        loop = tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
        
        for inputs, labels in loop:
            inputs, labels = inputs.to(device), labels.to(device)
            
            # Randomly choose Mixup (50%) or CutMix (50%)
            use_cutmix = np.random.rand() > 0.5
            if use_cutmix:
                inputs_aug, labels_a, labels_b, lam = cutmix_data(inputs.clone(), labels, alpha=CUTMIX_ALPHA)
            else:
                inputs_aug, labels_a, labels_b, lam = mixup_data(inputs, labels, alpha=MIXUP_ALPHA)
            
            optimizer.zero_grad()
            
            with torch.amp.autocast('cuda', enabled=torch.cuda.is_available()):
                # Student Forward
                student_outputs = model(inputs_aug)
                
                # Loss Calculation
                if teacher_model:
                    # Teacher Forward (no grad)
                    with torch.no_grad():
                        teacher_outputs = teacher_model(inputs_aug)
                    
                    if use_dkd:
                        # DKD Loss (Version 3)
                        loss = dkd_loss_mixup(
                            student_outputs, teacher_outputs, 
                            labels_a, labels_b, lam,
                            alpha=dkd_alpha, beta=dkd_beta, 
                            temperature=temperature, label_smoothing=label_smoothing
                        )
                    else:
                        # Standard KD Loss (for comparison)
                        loss = lam * kd_loss(student_outputs, teacher_outputs, labels_a, temperature, 0.7, label_smoothing) + \
                               (1 - lam) * kd_loss(student_outputs, teacher_outputs, labels_b, temperature, 0.7, label_smoothing)
                else:
                    # Standard CE with Label Smoothing for Teacher training
                    loss = lam * F.cross_entropy(student_outputs, labels_a, label_smoothing=label_smoothing) + \
                           (1 - lam) * F.cross_entropy(student_outputs, labels_b, label_smoothing=label_smoothing)
            
            # Skip batch if loss is NaN
            if torch.isnan(loss):
                loop.set_postfix(loss="NaN-skip")
                continue
            
            # Backward
            scaler.scale(loss).backward()
            
            if grad_clip > 0:
                scaler.unscale_(optimizer)
                nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
                
            scaler.step(optimizer)
            scaler.update()
            
            running_loss += loss.item()
            valid_batches += 1
            loop.set_postfix(loss=f"{loss.item():.3f}")

        # Step Scheduler (only after warmup)
        if epoch >= warmup_epochs:
            scheduler.step()
        
        # Validation
        train_loss = running_loss / max(valid_batches, 1)
        val_acc, val_loss = evaluate_model_with_loss(model, testloader, val_criterion)
        
        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['val_accuracy'].append(val_acc)
        
        current_lr = optimizer.param_groups[0]['lr']
        print(f"Epoch {epoch+1}/{num_epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}% | LR: {current_lr:.6f}")
        
        # Save Best
        if val_acc > best_acc:
            best_acc = val_acc
            best_model_wts = copy.deepcopy(model.state_dict())
            torch.save(model.state_dict(), MODEL_DIR / f"{model_name}.pth")
            epochs_no_improve = 0
            print(f"  ★ New best model saved! Accuracy: {best_acc:.2f}%")
        else:
            epochs_no_improve += 1
            
        # Checkpointing
        if (epoch + 1) % CHECKPOINT_FREQUENCY == 0:
            save_checkpoint(model, optimizer, scheduler, epoch, best_acc, history, model_name, epochs_no_improve)
            cleanup_old_checkpoints(model_name)
            
        # Early Stopping
        if epochs_no_improve >= patience:
            print(f"Early stopping triggered after {epoch+1} epochs.")
            break
        
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            
    print(f"\n{'='*50}")
    print(f"Training complete. Best accuracy: {best_acc:.2f}%")
    print(f"{'='*50}")
    model.load_state_dict(best_model_wts)
    return model, history

print("Training function loaded (Version 3: DKD + LR Warmup + Label Smoothing)")

Training function loaded (Version 3: DKD + LR Warmup + Label Smoothing)


In [6]:
# Cell 6: Initialize Models
print("Loading Teacher (EfficientNetV2-L)...")
teacher_model = efficientnet_v2_l(weights=EfficientNet_V2_L_Weights.IMAGENET1K_V1)
teacher_model.classifier[1] = nn.Linear(teacher_model.classifier[1].in_features, NUM_CLASSES)
teacher_model = teacher_model.to(device)

print("Loading Student (EfficientNetV2-S)...")
student_model = efficientnet_v2_s(weights=EfficientNet_V2_S_Weights.IMAGENET1K_V1)
student_model.classifier[1] = nn.Linear(student_model.classifier[1].in_features, NUM_CLASSES)
student_model = student_model.to(device)

print("Models loaded")

Loading Teacher (EfficientNetV2-L)...
Loading Student (EfficientNetV2-S)...
Models loaded


In [7]:
# Cell 7: Train/Load Teacher Model (Version 3)
print("\n" + "="*70)
print("TEACHER MODEL (Version 3)")
print("="*70)

teacher_path = MODEL_DIR / f"{TEACHER_NAME}.pth"

# Also check for existing v1/v2 teacher to use as starting point
existing_teacher = MODEL_DIR / "teacher_model.pth"

if teacher_path.exists():
    print(f"Found existing Teacher v3: {teacher_path}")
    teacher_model.load_state_dict(torch.load(teacher_path, map_location=device))
elif existing_teacher.exists():
    print(f"Found existing Teacher (v1/v2): {existing_teacher}")
    print(f"Loading and continuing training for {TEACHER_EPOCHS} epochs...")
    teacher_model.load_state_dict(torch.load(existing_teacher, map_location=device))
    
    # Continue training with extended epochs
    opt_t = optim.AdamW(teacher_model.parameters(), lr=LEARNING_RATE * 0.1, weight_decay=WEIGHT_DECAY)  # Lower LR for fine-tuning
    sch_t = optim.lr_scheduler.CosineAnnealingLR(opt_t, T_max=100)  # Additional 100 epochs
    
    teacher_model, teacher_history = train_model_v3(
        teacher_model, trainloader, opt_t, sch_t, 
        num_epochs=100,  # Additional epochs
        model_name=TEACHER_NAME, 
        teacher_model=None,
        warmup_epochs=0  # No warmup for fine-tuning
    )
else:
    print(f"Training Teacher Model v3 from scratch ({TEACHER_EPOCHS} epochs)...")
    print("This may take a while...")
    
    # Re-initialize teacher
    teacher_model = efficientnet_v2_l(weights=EfficientNet_V2_L_Weights.IMAGENET1K_V1)
    teacher_model.classifier[1] = nn.Linear(teacher_model.classifier[1].in_features, NUM_CLASSES)
    teacher_model = teacher_model.to(device)
    
    opt_t = optim.AdamW(teacher_model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
    sch_t = optim.lr_scheduler.CosineAnnealingLR(opt_t, T_max=TEACHER_EPOCHS - WARMUP_EPOCHS)
    
    teacher_model, teacher_history = train_model_v3(
        teacher_model, trainloader, opt_t, sch_t, 
        num_epochs=TEACHER_EPOCHS, 
        model_name=TEACHER_NAME, 
        teacher_model=None,
        warmup_epochs=WARMUP_EPOCHS
    )

# Evaluate Teacher
teacher_model.eval()
teacher_accuracy, _ = evaluate_model_with_loss(teacher_model, testloader, nn.CrossEntropyLoss())
print(f"\nTeacher v3 Accuracy: {teacher_accuracy:.2f}%")


TEACHER MODEL (Version 3)
Found existing Teacher v3: outputs\models\teacher_v3.pth

Teacher v3 Accuracy: 76.02%


In [8]:
# Cell 8: Train Student with DKD (Version 3.1 - Tuned Beta)
print("\n" + "="*70)
print("STUDENT MODEL with DKD (Version 3.1 - Tuned Beta)")
print("="*70)

student_path = MODEL_DIR / f"{STUDENT_NAME}.pth"

if student_path.exists():
    print(f"Found existing Student v3.1: {student_path}")
    student_model.load_state_dict(torch.load(student_path, map_location=device))
    distilled_accuracy, _ = evaluate_model_with_loss(student_model, testloader, nn.CrossEntropyLoss())
    print(f"Student v3.1 (DKD β=2.0) Accuracy: {distilled_accuracy:.2f}%")
else:
    print(f"\nStarting Decoupled Knowledge Distillation (DKD) v3.1...")
    print(f"  Key Change: DKD Beta REDUCED from 8.0 to {DKD_BETA}")
    print(f"  DKD Alpha (TCKD weight): {DKD_ALPHA}")
    print(f"  DKD Beta (NCKD weight): {DKD_BETA} ← REDUCED to fix over-regularization")
    print(f"  Temperature: {TEMPERATURE}")
    print(f"  Label Smoothing: {LABEL_SMOOTHING}")
    print(f"  LR Warmup: {WARMUP_EPOCHS} epochs")
    print(f"  Augmentation: AutoAugment + RandomErasing + CutMix/Mixup")
    
    # Re-initialize student model (fresh weights from ImageNet)
    student_model = efficientnet_v2_s(weights=EfficientNet_V2_S_Weights.IMAGENET1K_V1)
    student_model.classifier[1] = nn.Linear(student_model.classifier[1].in_features, NUM_CLASSES)
    student_model = student_model.to(device)
    
    # Optimizer & Scheduler
    opt_s = optim.AdamW(student_model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
    sch_s = optim.lr_scheduler.CosineAnnealingLR(opt_s, T_max=NUM_EPOCHS - WARMUP_EPOCHS)
    
    # Run DKD Training with tuned beta
    trained_student, distilled_history = train_model_v3(
        model=student_model,
        dataloader=trainloader,
        optimizer=opt_s,
        scheduler=sch_s,
        num_epochs=NUM_EPOCHS,
        model_name=STUDENT_NAME,
        teacher_model=teacher_model,
        use_dkd=True,  # Enable DKD loss
        dkd_alpha=DKD_ALPHA,
        dkd_beta=DKD_BETA,  # Now 2.0 instead of 8.0
        temperature=TEMPERATURE,
        label_smoothing=LABEL_SMOOTHING,
        patience=PATIENCE,
        warmup_epochs=WARMUP_EPOCHS
    )
    
    distilled_accuracy, _ = evaluate_model_with_loss(trained_student, testloader, nn.CrossEntropyLoss())
    print(f"\nStudent v3.1 (DKD β=2.0) Final Accuracy: {distilled_accuracy:.2f}%")


STUDENT MODEL with DKD (Version 3.1 - Tuned Beta)

Starting Decoupled Knowledge Distillation (DKD) v3.1...
  Key Change: DKD Beta REDUCED from 8.0 to 2.0
  DKD Alpha (TCKD weight): 1.0
  DKD Beta (NCKD weight): 2.0 ← REDUCED to fix over-regularization
  Temperature: 4.0
  Label Smoothing: 0.1
  LR Warmup: 5 epochs
  Augmentation: AutoAugment + RandomErasing + CutMix/Mixup
  Starting fresh training...
  Using DKD loss with Teacher guidance


Epoch 1/200: 100%|██████████| 390/390 [01:18<00:00,  4.99it/s, loss=1.317]


Epoch 1/200 | Train Loss: 2.0525 | Val Loss: 3.4177 | Val Acc: 25.66% | LR: 0.000200
  ★ New best model saved! Accuracy: 25.66%


Epoch 2/200: 100%|██████████| 390/390 [01:18<00:00,  4.94it/s, loss=1.340]


Epoch 2/200 | Train Loss: 1.8003 | Val Loss: 2.7542 | Val Acc: 39.99% | LR: 0.000400
  ★ New best model saved! Accuracy: 39.99%


Epoch 3/200: 100%|██████████| 390/390 [01:18<00:00,  4.94it/s, loss=3.615]


Epoch 3/200 | Train Loss: 1.6594 | Val Loss: 2.3862 | Val Acc: 46.53% | LR: 0.000600
  ★ New best model saved! Accuracy: 46.53%


Epoch 4/200: 100%|██████████| 390/390 [01:19<00:00,  4.90it/s, loss=1.017]


Epoch 4/200 | Train Loss: 1.6369 | Val Loss: 2.3853 | Val Acc: 46.39% | LR: 0.000800


Epoch 5/200: 100%|██████████| 390/390 [01:18<00:00,  4.94it/s, loss=1.168]


Epoch 5/200 | Train Loss: 1.6736 | Val Loss: 2.4945 | Val Acc: 45.99% | LR: 0.001000


Epoch 6/200: 100%|██████████| 390/390 [01:19<00:00,  4.89it/s, loss=3.043]


Epoch 6/200 | Train Loss: 1.6888 | Val Loss: 2.2593 | Val Acc: 48.10% | LR: 0.001000
  ★ New best model saved! Accuracy: 48.10%


Epoch 7/200: 100%|██████████| 390/390 [01:19<00:00,  4.90it/s, loss=1.057]


Epoch 7/200 | Train Loss: 1.6835 | Val Loss: 2.3503 | Val Acc: 48.38% | LR: 0.001000
  ★ New best model saved! Accuracy: 48.38%


Epoch 8/200: 100%|██████████| 390/390 [01:19<00:00,  4.93it/s, loss=1.126]


Epoch 8/200 | Train Loss: 1.5727 | Val Loss: 2.2474 | Val Acc: 50.17% | LR: 0.000999
  ★ New best model saved! Accuracy: 50.17%


Epoch 9/200: 100%|██████████| 390/390 [01:19<00:00,  4.91it/s, loss=1.057]


Epoch 9/200 | Train Loss: 1.5719 | Val Loss: 2.3025 | Val Acc: 49.70% | LR: 0.000999


Epoch 10/200: 100%|██████████| 390/390 [01:19<00:00,  4.92it/s, loss=1.029]


Epoch 10/200 | Train Loss: 1.6490 | Val Loss: 2.2031 | Val Acc: 49.64% | LR: 0.000998


Epoch 11/200: 100%|██████████| 390/390 [01:19<00:00,  4.90it/s, loss=1.079]


Epoch 11/200 | Train Loss: 1.5557 | Val Loss: 2.1657 | Val Acc: 53.14% | LR: 0.000998
  ★ New best model saved! Accuracy: 53.14%


Epoch 12/200: 100%|██████████| 390/390 [01:19<00:00,  4.93it/s, loss=1.034]


Epoch 12/200 | Train Loss: 1.4955 | Val Loss: 2.0878 | Val Acc: 52.59% | LR: 0.000997


Epoch 13/200: 100%|██████████| 390/390 [01:19<00:00,  4.92it/s, loss=0.935]


Epoch 13/200 | Train Loss: 1.4254 | Val Loss: 2.2182 | Val Acc: 52.01% | LR: 0.000996


Epoch 14/200: 100%|██████████| 390/390 [01:18<00:00,  4.94it/s, loss=1.074]


Epoch 14/200 | Train Loss: 1.5974 | Val Loss: 2.1129 | Val Acc: 53.23% | LR: 0.000995
  ★ New best model saved! Accuracy: 53.23%


Epoch 15/200: 100%|██████████| 390/390 [01:19<00:00,  4.93it/s, loss=0.719]


Epoch 15/200 | Train Loss: 1.5368 | Val Loss: 2.1317 | Val Acc: 53.67% | LR: 0.000994
  ★ New best model saved! Accuracy: 53.67%


Epoch 16/200: 100%|██████████| 390/390 [01:18<00:00,  4.99it/s, loss=0.997]


Epoch 16/200 | Train Loss: 1.4728 | Val Loss: 1.9663 | Val Acc: 54.39% | LR: 0.000992
  ★ New best model saved! Accuracy: 54.39%


Epoch 17/200: 100%|██████████| 390/390 [01:19<00:00,  4.92it/s, loss=3.353]


Epoch 17/200 | Train Loss: 1.3850 | Val Loss: 2.0201 | Val Acc: 54.87% | LR: 0.000991
  ★ New best model saved! Accuracy: 54.87%


Epoch 18/200: 100%|██████████| 390/390 [01:18<00:00,  4.95it/s, loss=1.035]


Epoch 18/200 | Train Loss: 1.4523 | Val Loss: 1.9863 | Val Acc: 56.01% | LR: 0.000989
  ★ New best model saved! Accuracy: 56.01%


Epoch 19/200: 100%|██████████| 390/390 [01:19<00:00,  4.92it/s, loss=0.948]


Epoch 19/200 | Train Loss: 1.4569 | Val Loss: 1.9722 | Val Acc: 56.36% | LR: 0.000987
  ★ New best model saved! Accuracy: 56.36%


Epoch 20/200: 100%|██████████| 390/390 [01:19<00:00,  4.94it/s, loss=3.116]


Epoch 20/200 | Train Loss: 1.5045 | Val Loss: 2.0370 | Val Acc: 55.45% | LR: 0.000985
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch20.pth


Epoch 21/200: 100%|██████████| 390/390 [01:18<00:00,  4.96it/s, loss=1.793]


Epoch 21/200 | Train Loss: 1.4191 | Val Loss: 1.9361 | Val Acc: 57.10% | LR: 0.000983
  ★ New best model saved! Accuracy: 57.10%


Epoch 22/200: 100%|██████████| 390/390 [01:19<00:00,  4.92it/s, loss=0.936]


Epoch 22/200 | Train Loss: 1.4288 | Val Loss: 2.0344 | Val Acc: 57.89% | LR: 0.000981
  ★ New best model saved! Accuracy: 57.89%


Epoch 23/200: 100%|██████████| 390/390 [01:18<00:00,  4.96it/s, loss=0.933]


Epoch 23/200 | Train Loss: 1.4974 | Val Loss: 1.9226 | Val Acc: 56.86% | LR: 0.000979


Epoch 24/200: 100%|██████████| 390/390 [01:18<00:00,  4.95it/s, loss=0.781]


Epoch 24/200 | Train Loss: 1.4637 | Val Loss: 1.9384 | Val Acc: 58.38% | LR: 0.000977
  ★ New best model saved! Accuracy: 58.38%


Epoch 25/200: 100%|██████████| 390/390 [01:18<00:00,  4.99it/s, loss=0.882]


Epoch 25/200 | Train Loss: 1.3645 | Val Loss: 2.1787 | Val Acc: 57.48% | LR: 0.000974


Epoch 26/200: 100%|██████████| 390/390 [01:16<00:00,  5.08it/s, loss=1.721]


Epoch 26/200 | Train Loss: 1.4195 | Val Loss: 1.9112 | Val Acc: 57.86% | LR: 0.000972


Epoch 27/200: 100%|██████████| 390/390 [01:16<00:00,  5.11it/s, loss=1.827]


Epoch 27/200 | Train Loss: 1.4244 | Val Loss: 1.8281 | Val Acc: 59.89% | LR: 0.000969
  ★ New best model saved! Accuracy: 59.89%


Epoch 28/200: 100%|██████████| 390/390 [01:16<00:00,  5.11it/s, loss=0.964]


Epoch 28/200 | Train Loss: 1.3793 | Val Loss: 1.7975 | Val Acc: 62.03% | LR: 0.000966
  ★ New best model saved! Accuracy: 62.03%


Epoch 29/200: 100%|██████████| 390/390 [01:17<00:00,  5.04it/s, loss=0.837]


Epoch 29/200 | Train Loss: 1.4469 | Val Loss: 1.7815 | Val Acc: 60.74% | LR: 0.000963


Epoch 30/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=3.117]


Epoch 30/200 | Train Loss: 1.4271 | Val Loss: 1.8985 | Val Acc: 58.96% | LR: 0.000960


Epoch 31/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=0.788]


Epoch 31/200 | Train Loss: 1.4162 | Val Loss: 1.7714 | Val Acc: 60.47% | LR: 0.000957


Epoch 32/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.822]


Epoch 32/200 | Train Loss: 1.4662 | Val Loss: 1.8845 | Val Acc: 59.39% | LR: 0.000953


Epoch 33/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=1.703]


Epoch 33/200 | Train Loss: 1.3888 | Val Loss: 1.8268 | Val Acc: 60.36% | LR: 0.000950


Epoch 34/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.741]


Epoch 34/200 | Train Loss: 1.3918 | Val Loss: 1.7756 | Val Acc: 60.76% | LR: 0.000946


Epoch 35/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=2.324]


Epoch 35/200 | Train Loss: 1.3837 | Val Loss: 1.7707 | Val Acc: 60.47% | LR: 0.000943


Epoch 36/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.878]


Epoch 36/200 | Train Loss: 1.3809 | Val Loss: 1.8257 | Val Acc: 61.40% | LR: 0.000939


Epoch 37/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=0.890]


Epoch 37/200 | Train Loss: 1.4053 | Val Loss: 1.8161 | Val Acc: 60.80% | LR: 0.000935


Epoch 38/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=1.742]


Epoch 38/200 | Train Loss: 1.3306 | Val Loss: 1.7031 | Val Acc: 62.19% | LR: 0.000931
  ★ New best model saved! Accuracy: 62.19%


Epoch 39/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=0.975]


Epoch 39/200 | Train Loss: 1.3258 | Val Loss: 1.7703 | Val Acc: 61.09% | LR: 0.000927


Epoch 40/200: 100%|██████████| 390/390 [01:14<00:00,  5.21it/s, loss=0.827]


Epoch 40/200 | Train Loss: 1.3366 | Val Loss: 1.7632 | Val Acc: 61.73% | LR: 0.000923
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch40.pth


Epoch 41/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=3.202]


Epoch 41/200 | Train Loss: 1.3688 | Val Loss: 1.7715 | Val Acc: 61.42% | LR: 0.000918


Epoch 42/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.794]


Epoch 42/200 | Train Loss: 1.3433 | Val Loss: 1.6948 | Val Acc: 61.91% | LR: 0.000914


Epoch 43/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=3.005]


Epoch 43/200 | Train Loss: 1.4086 | Val Loss: 1.7199 | Val Acc: 63.23% | LR: 0.000909
  ★ New best model saved! Accuracy: 63.23%


Epoch 44/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=2.652]


Epoch 44/200 | Train Loss: 1.3213 | Val Loss: 1.6941 | Val Acc: 63.50% | LR: 0.000905
  ★ New best model saved! Accuracy: 63.50%


Epoch 45/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=0.633]


Epoch 45/200 | Train Loss: 1.3557 | Val Loss: 1.6997 | Val Acc: 62.26% | LR: 0.000900


Epoch 46/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.797]


Epoch 46/200 | Train Loss: 1.4308 | Val Loss: 1.6911 | Val Acc: 63.29% | LR: 0.000895


Epoch 47/200: 100%|██████████| 390/390 [01:12<00:00,  5.35it/s, loss=2.956]


Epoch 47/200 | Train Loss: 1.3159 | Val Loss: 1.7132 | Val Acc: 62.20% | LR: 0.000890


Epoch 48/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.903]


Epoch 48/200 | Train Loss: 1.3724 | Val Loss: 1.6950 | Val Acc: 62.96% | LR: 0.000885


Epoch 49/200: 100%|██████████| 390/390 [01:12<00:00,  5.36it/s, loss=0.889]


Epoch 49/200 | Train Loss: 1.3886 | Val Loss: 1.7178 | Val Acc: 62.53% | LR: 0.000880


Epoch 50/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.913]


Epoch 50/200 | Train Loss: 1.3836 | Val Loss: 1.6891 | Val Acc: 63.16% | LR: 0.000874


Epoch 51/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=0.794]


Epoch 51/200 | Train Loss: 1.3390 | Val Loss: 1.6856 | Val Acc: 64.03% | LR: 0.000869
  ★ New best model saved! Accuracy: 64.03%


Epoch 52/200: 100%|██████████| 390/390 [01:12<00:00,  5.36it/s, loss=0.811]


Epoch 52/200 | Train Loss: 1.3437 | Val Loss: 1.6720 | Val Acc: 63.06% | LR: 0.000863


Epoch 53/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=0.896]


Epoch 53/200 | Train Loss: 1.3730 | Val Loss: 1.6619 | Val Acc: 64.50% | LR: 0.000858
  ★ New best model saved! Accuracy: 64.50%


Epoch 54/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=3.116]


Epoch 54/200 | Train Loss: 1.2570 | Val Loss: 1.6816 | Val Acc: 63.88% | LR: 0.000852


Epoch 55/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=2.025]


Epoch 55/200 | Train Loss: 1.3388 | Val Loss: 1.6408 | Val Acc: 64.94% | LR: 0.000846
  ★ New best model saved! Accuracy: 64.94%


Epoch 56/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=0.874]


Epoch 56/200 | Train Loss: 1.3244 | Val Loss: 1.7715 | Val Acc: 62.81% | LR: 0.000841


Epoch 57/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.836]


Epoch 57/200 | Train Loss: 1.2430 | Val Loss: 1.6830 | Val Acc: 63.45% | LR: 0.000835


Epoch 58/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=0.683]


Epoch 58/200 | Train Loss: 1.3268 | Val Loss: 1.6249 | Val Acc: 64.75% | LR: 0.000829


Epoch 59/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.869]


Epoch 59/200 | Train Loss: 1.4009 | Val Loss: 1.7031 | Val Acc: 63.47% | LR: 0.000822


Epoch 60/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.746]


Epoch 60/200 | Train Loss: 1.3311 | Val Loss: 1.5981 | Val Acc: 64.90% | LR: 0.000816
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch60.pth


Epoch 61/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.922]


Epoch 61/200 | Train Loss: 1.3024 | Val Loss: 1.6325 | Val Acc: 64.00% | LR: 0.000810


Epoch 62/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=1.033]


Epoch 62/200 | Train Loss: 1.3237 | Val Loss: 1.6703 | Val Acc: 64.27% | LR: 0.000804


Epoch 63/200: 100%|██████████| 390/390 [01:12<00:00,  5.37it/s, loss=0.742]


Epoch 63/200 | Train Loss: 1.3131 | Val Loss: 1.6312 | Val Acc: 64.27% | LR: 0.000797


Epoch 64/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.824]


Epoch 64/200 | Train Loss: 1.3628 | Val Loss: 1.6519 | Val Acc: 64.77% | LR: 0.000791


Epoch 65/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.647]


Epoch 65/200 | Train Loss: 1.2973 | Val Loss: 1.7233 | Val Acc: 64.45% | LR: 0.000784


Epoch 66/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.847]


Epoch 66/200 | Train Loss: 1.2869 | Val Loss: 1.6619 | Val Acc: 65.04% | LR: 0.000777
  ★ New best model saved! Accuracy: 65.04%


Epoch 67/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=0.951]


Epoch 67/200 | Train Loss: 1.3881 | Val Loss: 1.6216 | Val Acc: 63.91% | LR: 0.000771


Epoch 68/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=0.901]


Epoch 68/200 | Train Loss: 1.3028 | Val Loss: 1.5865 | Val Acc: 64.74% | LR: 0.000764


Epoch 69/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=2.449]


Epoch 69/200 | Train Loss: 1.2936 | Val Loss: 1.5679 | Val Acc: 66.39% | LR: 0.000757
  ★ New best model saved! Accuracy: 66.39%


Epoch 70/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.813]


Epoch 70/200 | Train Loss: 1.2464 | Val Loss: 1.6925 | Val Acc: 63.87% | LR: 0.000750


Epoch 71/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.873]


Epoch 71/200 | Train Loss: 1.2653 | Val Loss: 1.6297 | Val Acc: 66.08% | LR: 0.000743


Epoch 72/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=2.366]


Epoch 72/200 | Train Loss: 1.2371 | Val Loss: 1.5885 | Val Acc: 65.52% | LR: 0.000736


Epoch 73/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.628]


Epoch 73/200 | Train Loss: 1.3613 | Val Loss: 1.5616 | Val Acc: 65.60% | LR: 0.000729


Epoch 74/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.524]


Epoch 74/200 | Train Loss: 1.2502 | Val Loss: 1.6018 | Val Acc: 65.81% | LR: 0.000722


Epoch 75/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=3.079]


Epoch 75/200 | Train Loss: 1.3654 | Val Loss: 1.5904 | Val Acc: 65.15% | LR: 0.000714


Epoch 76/200: 100%|██████████| 390/390 [01:12<00:00,  5.34it/s, loss=0.898]


Epoch 76/200 | Train Loss: 1.3027 | Val Loss: 1.5923 | Val Acc: 65.53% | LR: 0.000707


Epoch 77/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.804]


Epoch 77/200 | Train Loss: 1.2179 | Val Loss: 1.6792 | Val Acc: 65.21% | LR: 0.000700


Epoch 78/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.797]


Epoch 78/200 | Train Loss: 1.3075 | Val Loss: 1.6243 | Val Acc: 65.29% | LR: 0.000692


Epoch 79/200: 100%|██████████| 390/390 [01:12<00:00,  5.35it/s, loss=0.852]


Epoch 79/200 | Train Loss: 1.3165 | Val Loss: 1.5411 | Val Acc: 66.39% | LR: 0.000685


Epoch 80/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.903]


Epoch 80/200 | Train Loss: 1.2246 | Val Loss: 1.5963 | Val Acc: 65.41% | LR: 0.000677
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch80.pth
  Cleaned up: student_v3_1_dkd_beta2_epoch20.pth


Epoch 81/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.832]


Epoch 81/200 | Train Loss: 1.2729 | Val Loss: 1.5618 | Val Acc: 65.63% | LR: 0.000670


Epoch 82/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=1.494]


Epoch 82/200 | Train Loss: 1.3312 | Val Loss: 1.6459 | Val Acc: 64.95% | LR: 0.000662


Epoch 83/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=1.619]


Epoch 83/200 | Train Loss: 1.3456 | Val Loss: 1.5811 | Val Acc: 65.86% | LR: 0.000655


Epoch 84/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.766]


Epoch 84/200 | Train Loss: 1.2525 | Val Loss: 1.5868 | Val Acc: 66.23% | LR: 0.000647


Epoch 85/200: 100%|██████████| 390/390 [01:12<00:00,  5.36it/s, loss=0.591]


Epoch 85/200 | Train Loss: 1.2541 | Val Loss: 1.5574 | Val Acc: 66.69% | LR: 0.000639
  ★ New best model saved! Accuracy: 66.69%


Epoch 86/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=0.829]


Epoch 86/200 | Train Loss: 1.2614 | Val Loss: 1.5510 | Val Acc: 65.38% | LR: 0.000631


Epoch 87/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.749]


Epoch 87/200 | Train Loss: 1.2896 | Val Loss: 1.5880 | Val Acc: 65.93% | LR: 0.000624


Epoch 88/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.617]


Epoch 88/200 | Train Loss: 1.2950 | Val Loss: 1.5167 | Val Acc: 66.68% | LR: 0.000616


Epoch 89/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=0.789]


Epoch 89/200 | Train Loss: 1.2372 | Val Loss: 1.5166 | Val Acc: 67.50% | LR: 0.000608
  ★ New best model saved! Accuracy: 67.50%


Epoch 90/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.801]


Epoch 90/200 | Train Loss: 1.2650 | Val Loss: 1.5462 | Val Acc: 67.32% | LR: 0.000600


Epoch 91/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=2.538]


Epoch 91/200 | Train Loss: 1.2945 | Val Loss: 1.5384 | Val Acc: 66.39% | LR: 0.000592


Epoch 92/200: 100%|██████████| 390/390 [01:12<00:00,  5.36it/s, loss=2.556]


Epoch 92/200 | Train Loss: 1.1503 | Val Loss: 1.4519 | Val Acc: 67.40% | LR: 0.000584


Epoch 93/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=0.845]


Epoch 93/200 | Train Loss: 1.2169 | Val Loss: 1.5400 | Val Acc: 67.57% | LR: 0.000576
  ★ New best model saved! Accuracy: 67.57%


Epoch 94/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.596]


Epoch 94/200 | Train Loss: 1.2370 | Val Loss: 1.5318 | Val Acc: 67.15% | LR: 0.000568


Epoch 95/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.619]


Epoch 95/200 | Train Loss: 1.2579 | Val Loss: 1.6818 | Val Acc: 65.52% | LR: 0.000560


Epoch 96/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=1.604]


Epoch 96/200 | Train Loss: 1.2732 | Val Loss: 1.4777 | Val Acc: 68.31% | LR: 0.000552
  ★ New best model saved! Accuracy: 68.31%


Epoch 97/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.658]


Epoch 97/200 | Train Loss: 1.3151 | Val Loss: 1.7384 | Val Acc: 65.78% | LR: 0.000544


Epoch 98/200: 100%|██████████| 390/390 [01:12<00:00,  5.34it/s, loss=0.703]


Epoch 98/200 | Train Loss: 1.1956 | Val Loss: 1.7078 | Val Acc: 66.67% | LR: 0.000536


Epoch 99/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=2.210]


Epoch 99/200 | Train Loss: 1.2681 | Val Loss: 1.5371 | Val Acc: 67.53% | LR: 0.000528


Epoch 100/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=2.918]


Epoch 100/200 | Train Loss: 1.2645 | Val Loss: 1.5541 | Val Acc: 66.31% | LR: 0.000520
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch100.pth
  Cleaned up: student_v3_1_dkd_beta2_epoch100.pth


Epoch 101/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.802]


Epoch 101/200 | Train Loss: 1.2527 | Val Loss: 1.9906 | Val Acc: 66.98% | LR: 0.000512


Epoch 102/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=2.653]


Epoch 102/200 | Train Loss: 1.2873 | Val Loss: 1.8024 | Val Acc: 65.46% | LR: 0.000504


Epoch 103/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=1.344]


Epoch 103/200 | Train Loss: 1.2829 | Val Loss: 1.7623 | Val Acc: 66.54% | LR: 0.000496


Epoch 104/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=0.794]


Epoch 104/200 | Train Loss: 1.2095 | Val Loss: 1.5579 | Val Acc: 66.82% | LR: 0.000488


Epoch 105/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.790]


Epoch 105/200 | Train Loss: 1.2337 | Val Loss: 1.6308 | Val Acc: 67.41% | LR: 0.000480


Epoch 106/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.803]


Epoch 106/200 | Train Loss: 1.2193 | Val Loss: 1.6777 | Val Acc: 66.39% | LR: 0.000472


Epoch 107/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.816]


Epoch 107/200 | Train Loss: 1.2457 | Val Loss: 1.5021 | Val Acc: 68.54% | LR: 0.000464
  ★ New best model saved! Accuracy: 68.54%


Epoch 108/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.761]


Epoch 108/200 | Train Loss: 1.2417 | Val Loss: 2.2689 | Val Acc: 67.24% | LR: 0.000456


Epoch 109/200: 100%|██████████| 390/390 [01:13<00:00,  5.34it/s, loss=0.530]


Epoch 109/200 | Train Loss: 1.2237 | Val Loss: 1.5778 | Val Acc: 67.74% | LR: 0.000448


Epoch 110/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=2.363]


Epoch 110/200 | Train Loss: 1.1484 | Val Loss: 1.5059 | Val Acc: 67.90% | LR: 0.000440


Epoch 111/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.761]


Epoch 111/200 | Train Loss: 1.1333 | Val Loss: 1.5782 | Val Acc: 68.82% | LR: 0.000432
  ★ New best model saved! Accuracy: 68.82%


Epoch 112/200: 100%|██████████| 390/390 [01:12<00:00,  5.36it/s, loss=1.427]


Epoch 112/200 | Train Loss: 1.1883 | Val Loss: 2.5452 | Val Acc: 67.25% | LR: 0.000424


Epoch 113/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=3.492]


Epoch 113/200 | Train Loss: 1.2059 | Val Loss: 2.1887 | Val Acc: 67.35% | LR: 0.000416


Epoch 114/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=0.638]


Epoch 114/200 | Train Loss: 1.2091 | Val Loss: 1.9560 | Val Acc: 69.37% | LR: 0.000408
  ★ New best model saved! Accuracy: 69.37%


Epoch 115/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.747]


Epoch 115/200 | Train Loss: 1.1787 | Val Loss: 2.6276 | Val Acc: 68.71% | LR: 0.000400


Epoch 116/200: 100%|██████████| 390/390 [01:13<00:00,  5.33it/s, loss=0.508]


Epoch 116/200 | Train Loss: 1.1702 | Val Loss: 1.5897 | Val Acc: 68.75% | LR: 0.000392


Epoch 117/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=2.226]


Epoch 117/200 | Train Loss: 1.1738 | Val Loss: 1.3981 | Val Acc: 69.35% | LR: 0.000384


Epoch 118/200: 100%|██████████| 390/390 [01:12<00:00,  5.35it/s, loss=0.640]


Epoch 118/200 | Train Loss: 1.2094 | Val Loss: 1.6214 | Val Acc: 69.46% | LR: 0.000376
  ★ New best model saved! Accuracy: 69.46%


Epoch 119/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=0.700]


Epoch 119/200 | Train Loss: 1.1960 | Val Loss: 1.5460 | Val Acc: 69.12% | LR: 0.000369


Epoch 120/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.816]


Epoch 120/200 | Train Loss: 1.1653 | Val Loss: 1.8231 | Val Acc: 68.34% | LR: 0.000361
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch120.pth
  Cleaned up: student_v3_1_dkd_beta2_epoch120.pth


Epoch 121/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=0.490]


Epoch 121/200 | Train Loss: 1.2463 | Val Loss: 1.4859 | Val Acc: 68.71% | LR: 0.000353


Epoch 122/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.814]


Epoch 122/200 | Train Loss: 1.2002 | Val Loss: 1.5196 | Val Acc: 69.07% | LR: 0.000345


Epoch 123/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.720]


Epoch 123/200 | Train Loss: 1.1487 | Val Loss: 2.5122 | Val Acc: 68.66% | LR: 0.000338


Epoch 124/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=0.668]


Epoch 124/200 | Train Loss: 1.2188 | Val Loss: 1.4477 | Val Acc: 69.97% | LR: 0.000330
  ★ New best model saved! Accuracy: 69.97%


Epoch 125/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.711]


Epoch 125/200 | Train Loss: 1.1168 | Val Loss: 1.5111 | Val Acc: 69.31% | LR: 0.000323


Epoch 126/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.766]


Epoch 126/200 | Train Loss: 1.1469 | Val Loss: 1.5291 | Val Acc: 68.94% | LR: 0.000315


Epoch 127/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=1.979]


Epoch 127/200 | Train Loss: 1.2199 | Val Loss: 1.3985 | Val Acc: 70.13% | LR: 0.000308
  ★ New best model saved! Accuracy: 70.13%


Epoch 128/200: 100%|██████████| 390/390 [01:14<00:00,  5.27it/s, loss=0.709]


Epoch 128/200 | Train Loss: 1.1360 | Val Loss: 1.3872 | Val Acc: 70.50% | LR: 0.000300
  ★ New best model saved! Accuracy: 70.50%


Epoch 129/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.762]


Epoch 129/200 | Train Loss: 1.1528 | Val Loss: 1.4230 | Val Acc: 70.27% | LR: 0.000293


Epoch 130/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.614]


Epoch 130/200 | Train Loss: 1.1641 | Val Loss: 1.4337 | Val Acc: 71.09% | LR: 0.000286
  ★ New best model saved! Accuracy: 71.09%


Epoch 131/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=2.365]


Epoch 131/200 | Train Loss: 1.1608 | Val Loss: 1.4394 | Val Acc: 70.16% | LR: 0.000278


Epoch 132/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.607]


Epoch 132/200 | Train Loss: 1.1391 | Val Loss: 1.3988 | Val Acc: 70.63% | LR: 0.000271


Epoch 133/200: 100%|██████████| 390/390 [01:12<00:00,  5.34it/s, loss=1.949]


Epoch 133/200 | Train Loss: 1.1780 | Val Loss: 1.3668 | Val Acc: 71.11% | LR: 0.000264
  ★ New best model saved! Accuracy: 71.11%


Epoch 134/200: 100%|██████████| 390/390 [01:13<00:00,  5.32it/s, loss=0.792]


Epoch 134/200 | Train Loss: 1.1192 | Val Loss: 1.3953 | Val Acc: 70.81% | LR: 0.000257


Epoch 135/200: 100%|██████████| 390/390 [01:15<00:00,  5.19it/s, loss=0.691]


Epoch 135/200 | Train Loss: 1.1604 | Val Loss: 1.6678 | Val Acc: 70.48% | LR: 0.000250


Epoch 136/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=1.149]


Epoch 136/200 | Train Loss: 1.1073 | Val Loss: 1.4702 | Val Acc: 70.43% | LR: 0.000243


Epoch 137/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=0.626]


Epoch 137/200 | Train Loss: 1.1152 | Val Loss: 1.6495 | Val Acc: 70.09% | LR: 0.000236


Epoch 138/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=0.676]


Epoch 138/200 | Train Loss: 1.1570 | Val Loss: 1.7968 | Val Acc: 70.47% | LR: 0.000229


Epoch 139/200: 100%|██████████| 390/390 [01:14<00:00,  5.23it/s, loss=0.744]


Epoch 139/200 | Train Loss: 1.1445 | Val Loss: 1.4114 | Val Acc: 70.78% | LR: 0.000223


Epoch 140/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=0.651]


Epoch 140/200 | Train Loss: 1.1083 | Val Loss: 1.4549 | Val Acc: 70.99% | LR: 0.000216
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch140.pth
  Cleaned up: student_v3_1_dkd_beta2_epoch140.pth


Epoch 141/200: 100%|██████████| 390/390 [01:14<00:00,  5.22it/s, loss=2.084]


Epoch 141/200 | Train Loss: 1.1818 | Val Loss: 1.3315 | Val Acc: 71.41% | LR: 0.000209
  ★ New best model saved! Accuracy: 71.41%


Epoch 142/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=3.529]


Epoch 142/200 | Train Loss: 1.1090 | Val Loss: 1.4386 | Val Acc: 71.12% | LR: 0.000203


Epoch 143/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.581]


Epoch 143/200 | Train Loss: 1.0924 | Val Loss: 1.3979 | Val Acc: 71.93% | LR: 0.000196
  ★ New best model saved! Accuracy: 71.93%


Epoch 144/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=0.714]


Epoch 144/200 | Train Loss: 1.1658 | Val Loss: 1.5252 | Val Acc: 71.41% | LR: 0.000190


Epoch 145/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=1.763]


Epoch 145/200 | Train Loss: 1.1557 | Val Loss: 1.4070 | Val Acc: 71.33% | LR: 0.000184


Epoch 146/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=0.546]


Epoch 146/200 | Train Loss: 1.1801 | Val Loss: 1.4952 | Val Acc: 70.75% | LR: 0.000178


Epoch 147/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=1.387]


Epoch 147/200 | Train Loss: 1.1292 | Val Loss: 1.4213 | Val Acc: 71.42% | LR: 0.000171


Epoch 148/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=0.465]


Epoch 148/200 | Train Loss: 1.0911 | Val Loss: 1.2900 | Val Acc: 72.26% | LR: 0.000165
  ★ New best model saved! Accuracy: 72.26%


Epoch 149/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=0.567]


Epoch 149/200 | Train Loss: 1.1181 | Val Loss: 1.6638 | Val Acc: 71.33% | LR: 0.000159


Epoch 150/200: 100%|██████████| 390/390 [01:14<00:00,  5.21it/s, loss=3.361]


Epoch 150/200 | Train Loss: 1.0836 | Val Loss: 1.4503 | Val Acc: 71.98% | LR: 0.000154


Epoch 151/200: 100%|██████████| 390/390 [01:14<00:00,  5.23it/s, loss=0.823]


Epoch 151/200 | Train Loss: 1.1040 | Val Loss: 2.4681 | Val Acc: 70.25% | LR: 0.000148


Epoch 152/200: 100%|██████████| 390/390 [01:14<00:00,  5.23it/s, loss=0.598]


Epoch 152/200 | Train Loss: 1.0587 | Val Loss: 1.3561 | Val Acc: 72.16% | LR: 0.000142


Epoch 153/200: 100%|██████████| 390/390 [01:14<00:00,  5.20it/s, loss=1.316]


Epoch 153/200 | Train Loss: 1.1071 | Val Loss: 1.6214 | Val Acc: 71.54% | LR: 0.000137


Epoch 154/200: 100%|██████████| 390/390 [01:14<00:00,  5.22it/s, loss=0.753]


Epoch 154/200 | Train Loss: 1.1481 | Val Loss: 1.5115 | Val Acc: 71.71% | LR: 0.000131


Epoch 155/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=0.455]


Epoch 155/200 | Train Loss: 1.0890 | Val Loss: 1.4686 | Val Acc: 72.01% | LR: 0.000126


Epoch 156/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=0.445]


Epoch 156/200 | Train Loss: 1.0990 | Val Loss: 1.3467 | Val Acc: 72.30% | LR: 0.000120
  ★ New best model saved! Accuracy: 72.30%


Epoch 157/200: 100%|██████████| 390/390 [01:14<00:00,  5.23it/s, loss=0.709]


Epoch 157/200 | Train Loss: 1.0744 | Val Loss: 1.7305 | Val Acc: 71.78% | LR: 0.000115


Epoch 158/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=0.716]


Epoch 158/200 | Train Loss: 1.0482 | Val Loss: 1.3701 | Val Acc: 72.53% | LR: 0.000110
  ★ New best model saved! Accuracy: 72.53%


Epoch 159/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.655]


Epoch 159/200 | Train Loss: 1.0916 | Val Loss: 1.6118 | Val Acc: 72.08% | LR: 0.000105


Epoch 160/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=3.050]


Epoch 160/200 | Train Loss: 1.0678 | Val Loss: 1.3506 | Val Acc: 72.61% | LR: 0.000100
  ★ New best model saved! Accuracy: 72.61%
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch160.pth
  Cleaned up: student_v3_1_dkd_beta2_epoch160.pth


Epoch 161/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=0.414]


Epoch 161/200 | Train Loss: 1.1204 | Val Loss: 1.3496 | Val Acc: 72.79% | LR: 0.000095
  ★ New best model saved! Accuracy: 72.79%


Epoch 162/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=2.200]


Epoch 162/200 | Train Loss: 1.2001 | Val Loss: 1.3384 | Val Acc: 73.23% | LR: 0.000091
  ★ New best model saved! Accuracy: 73.23%


Epoch 163/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=0.513]


Epoch 163/200 | Train Loss: 1.0926 | Val Loss: 1.2753 | Val Acc: 73.22% | LR: 0.000086


Epoch 164/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=2.497]


Epoch 164/200 | Train Loss: 1.0940 | Val Loss: 1.3488 | Val Acc: 73.19% | LR: 0.000082


Epoch 165/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=3.518]


Epoch 165/200 | Train Loss: 1.1133 | Val Loss: 2.0715 | Val Acc: 72.13% | LR: 0.000077


Epoch 166/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=1.014]


Epoch 166/200 | Train Loss: 1.1191 | Val Loss: 1.3418 | Val Acc: 73.18% | LR: 0.000073


Epoch 167/200: 100%|██████████| 390/390 [01:14<00:00,  5.22it/s, loss=0.412]


Epoch 167/200 | Train Loss: 1.0902 | Val Loss: 1.2552 | Val Acc: 73.48% | LR: 0.000069
  ★ New best model saved! Accuracy: 73.48%


Epoch 168/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=0.473]


Epoch 168/200 | Train Loss: 1.0506 | Val Loss: 1.3756 | Val Acc: 73.02% | LR: 0.000065


Epoch 169/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.538]


Epoch 169/200 | Train Loss: 1.1058 | Val Loss: 1.4782 | Val Acc: 72.83% | LR: 0.000061


Epoch 170/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=0.663]


Epoch 170/200 | Train Loss: 1.0775 | Val Loss: 1.4037 | Val Acc: 73.40% | LR: 0.000057


Epoch 171/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=0.586]


Epoch 171/200 | Train Loss: 1.0693 | Val Loss: 1.3976 | Val Acc: 72.86% | LR: 0.000054


Epoch 172/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.553]


Epoch 172/200 | Train Loss: 1.0227 | Val Loss: 1.4470 | Val Acc: 72.96% | LR: 0.000050


Epoch 173/200: 100%|██████████| 390/390 [01:13<00:00,  5.31it/s, loss=0.467]


Epoch 173/200 | Train Loss: 1.0382 | Val Loss: 1.2465 | Val Acc: 73.60% | LR: 0.000047
  ★ New best model saved! Accuracy: 73.60%


Epoch 174/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=0.409]


Epoch 174/200 | Train Loss: 0.9822 | Val Loss: 1.3117 | Val Acc: 73.48% | LR: 0.000043


Epoch 175/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.703]


Epoch 175/200 | Train Loss: 1.0661 | Val Loss: 1.4264 | Val Acc: 73.22% | LR: 0.000040


Epoch 176/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=0.701]


Epoch 176/200 | Train Loss: 1.0489 | Val Loss: 1.4557 | Val Acc: 73.60% | LR: 0.000037


Epoch 177/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.691]


Epoch 177/200 | Train Loss: 1.0717 | Val Loss: 1.4569 | Val Acc: 73.08% | LR: 0.000034


Epoch 178/200: 100%|██████████| 390/390 [01:14<00:00,  5.27it/s, loss=0.526]


Epoch 178/200 | Train Loss: 1.0109 | Val Loss: 1.4018 | Val Acc: 73.78% | LR: 0.000031
  ★ New best model saved! Accuracy: 73.78%


Epoch 179/200: 100%|██████████| 390/390 [01:13<00:00,  5.30it/s, loss=0.700]


Epoch 179/200 | Train Loss: 1.1420 | Val Loss: 1.2828 | Val Acc: 73.73% | LR: 0.000028


Epoch 180/200: 100%|██████████| 390/390 [01:14<00:00,  5.27it/s, loss=0.399]


Epoch 180/200 | Train Loss: 1.0247 | Val Loss: 1.3125 | Val Acc: 73.61% | LR: 0.000026
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch180.pth
  Cleaned up: student_v3_1_dkd_beta2_epoch180.pth


Epoch 181/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=0.705]


Epoch 181/200 | Train Loss: 1.0289 | Val Loss: 1.4978 | Val Acc: 72.81% | LR: 0.000023


Epoch 182/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.692]


Epoch 182/200 | Train Loss: 1.0376 | Val Loss: 1.5420 | Val Acc: 73.04% | LR: 0.000021


Epoch 183/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=0.399]


Epoch 183/200 | Train Loss: 1.0719 | Val Loss: 1.4947 | Val Acc: 73.32% | LR: 0.000019


Epoch 184/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=0.550]


Epoch 184/200 | Train Loss: 1.0535 | Val Loss: 1.3238 | Val Acc: 73.50% | LR: 0.000017


Epoch 185/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=3.417]


Epoch 185/200 | Train Loss: 1.0362 | Val Loss: 1.3997 | Val Acc: 73.41% | LR: 0.000015


Epoch 186/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.660]


Epoch 186/200 | Train Loss: 1.0540 | Val Loss: 1.2796 | Val Acc: 73.98% | LR: 0.000013
  ★ New best model saved! Accuracy: 73.98%


Epoch 187/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=1.757]


Epoch 187/200 | Train Loss: 1.0241 | Val Loss: 1.3191 | Val Acc: 73.71% | LR: 0.000011


Epoch 188/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=1.793]


Epoch 188/200 | Train Loss: 0.9912 | Val Loss: 1.3611 | Val Acc: 73.43% | LR: 0.000009


Epoch 189/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=2.282]


Epoch 189/200 | Train Loss: 1.0432 | Val Loss: 1.5905 | Val Acc: 73.32% | LR: 0.000008


Epoch 190/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=0.438]


Epoch 190/200 | Train Loss: 0.9985 | Val Loss: 1.3640 | Val Acc: 73.43% | LR: 0.000006


Epoch 191/200: 100%|██████████| 390/390 [01:14<00:00,  5.24it/s, loss=0.639]


Epoch 191/200 | Train Loss: 1.0797 | Val Loss: 1.3120 | Val Acc: 73.69% | LR: 0.000005


Epoch 192/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=0.420]


Epoch 192/200 | Train Loss: 0.9801 | Val Loss: 1.2683 | Val Acc: 73.82% | LR: 0.000004


Epoch 193/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=0.595]


Epoch 193/200 | Train Loss: 1.0271 | Val Loss: 1.4089 | Val Acc: 73.31% | LR: 0.000003


Epoch 194/200: 100%|██████████| 390/390 [01:14<00:00,  5.20it/s, loss=0.707]


Epoch 194/200 | Train Loss: 1.0315 | Val Loss: 1.4823 | Val Acc: 73.40% | LR: 0.000002


Epoch 195/200: 100%|██████████| 390/390 [01:13<00:00,  5.29it/s, loss=3.483]


Epoch 195/200 | Train Loss: 1.0199 | Val Loss: 1.2919 | Val Acc: 73.80% | LR: 0.000002


Epoch 196/200: 100%|██████████| 390/390 [01:13<00:00,  5.28it/s, loss=0.497]


Epoch 196/200 | Train Loss: 1.0648 | Val Loss: 1.4802 | Val Acc: 73.57% | LR: 0.000001


Epoch 197/200: 100%|██████████| 390/390 [01:14<00:00,  5.25it/s, loss=2.159]


Epoch 197/200 | Train Loss: 1.0852 | Val Loss: 1.3730 | Val Acc: 73.55% | LR: 0.000001


Epoch 198/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=0.570]


Epoch 198/200 | Train Loss: 1.0429 | Val Loss: 1.4996 | Val Acc: 73.04% | LR: 0.000000


Epoch 199/200: 100%|██████████| 390/390 [01:13<00:00,  5.27it/s, loss=0.502]


Epoch 199/200 | Train Loss: 1.0405 | Val Loss: 1.2395 | Val Acc: 73.95% | LR: 0.000000


Epoch 200/200: 100%|██████████| 390/390 [01:14<00:00,  5.26it/s, loss=2.992]


Epoch 200/200 | Train Loss: 1.0202 | Val Loss: 1.3270 | Val Acc: 73.83% | LR: 0.000000
  Checkpoint saved: outputs\checkpoints\student_v3_1_dkd_beta2_epoch200.pth
  Cleaned up: student_v3_1_dkd_beta2_epoch200.pth

Training complete. Best accuracy: 73.98%

Student v3.1 (DKD β=2.0) Final Accuracy: 73.98%


In [9]:
# Cell 9: Results Summary (Version 3.1)
print("\n" + "="*80)
print("VERSION 3.1 RESULTS SUMMARY - Tuned DKD (Beta=2.0)")
print("="*80)

# Evaluate all available models for comparison
criterion = nn.CrossEntropyLoss()

print("\n┌─────────────────────────────────────────────────────────────────────┐")
print("│                    MODEL COMPARISON TABLE                           │")
print("├──────────────────────────────┬──────────────┬───────────────────────┤")
print("│ Model                        │ Accuracy (%) │ Notes                 │")
print("├──────────────────────────────┼──────────────┼───────────────────────┤")

# Teacher v3
try:
    print(f"│ Teacher v3 (EfficientNet-L)  │ {teacher_accuracy:12.2f} │ Extended training     │")
except NameError:
    print(f"│ Teacher v3 (EfficientNet-L)  │ {'N/A':>12} │                       │")

# Student v3.1 (DKD with tuned beta)
try:
    print(f"│ Student v3.1 DKD β=2.0       │ {distilled_accuracy:12.2f} │ Tuned NCKD weight     │")
except NameError:
    print(f"│ Student v3.1 DKD β=2.0       │ {'N/A':>12} │                       │")

print("├──────────────────────────────┴──────────────┴───────────────────────┤")
print("│                    PREVIOUS VERSIONS (for comparison)              │")
print("├──────────────────────────────┬──────────────┬───────────────────────┤")
print("│ Student v1 (Baseline)        │        72.34 │ Standard KD           │")
print("│ Student v2 (Enhanced)        │        74.20 │ + AutoAug + Warmup    │")
print("│ Student v3 (DKD β=8.0)       │        72.69 │ Over-regularized      │")
print("└──────────────────────────────┴──────────────┴───────────────────────┘")

print("\n" + "="*80)
print("VERSION 3.1 ANALYSIS:")
try:
    gap = distilled_accuracy - teacher_accuracy
    retention = (distilled_accuracy / teacher_accuracy) * 100
    print(f"  Teacher v3 Accuracy:       {teacher_accuracy:.2f}%")
    print(f"  Student v3.1 Accuracy:     {distilled_accuracy:.2f}%")
    print(f"  Gap from Teacher:          {gap:+.2f}%")
    print(f"  Teacher Retention:         {retention:.1f}%")
    
    # Compare with v3
    v3_accuracy = 72.69
    improvement = distilled_accuracy - v3_accuracy
    print(f"\n  Improvement over v3 (β=8.0): {improvement:+.2f}%")
except NameError:
    print("  Run training cells first to see results.")
print("="*80)

print("\nVersion 3.1 Key Change:")
print(f"  ✓ DKD Beta REDUCED: 8.0 → {DKD_BETA}")
print(f"  ✓ Hypothesis: Less aggressive NCKD reduces over-regularization")
print(f"  ✓ Expected: Better balance with Mixup/CutMix augmentation")

print(f"\nModel Files:")
print(f"  Teacher v3:   {MODEL_DIR / TEACHER_NAME}.pth")
print(f"  Student v3.1: {MODEL_DIR / STUDENT_NAME}.pth")
print(f"\nCheckpoints: {CHECKPOINT_DIR.absolute()}")


VERSION 3.1 RESULTS SUMMARY - Tuned DKD (Beta=2.0)

┌─────────────────────────────────────────────────────────────────────┐
│                    MODEL COMPARISON TABLE                           │
├──────────────────────────────┬──────────────┬───────────────────────┤
│ Model                        │ Accuracy (%) │ Notes                 │
├──────────────────────────────┼──────────────┼───────────────────────┤
│ Teacher v3 (EfficientNet-L)  │        76.02 │ Extended training     │
│ Student v3.1 DKD β=2.0       │        73.98 │ Tuned NCKD weight     │
├──────────────────────────────┴──────────────┴───────────────────────┤
│                    PREVIOUS VERSIONS (for comparison)              │
├──────────────────────────────┬──────────────┬───────────────────────┤
│ Student v1 (Baseline)        │        72.34 │ Standard KD           │
│ Student v2 (Enhanced)        │        74.20 │ + AutoAug + Warmup    │
│ Student v3 (DKD β=8.0)       │        72.69 │ Over-regularized      │
└───────────