# Training ImageNet / ImageNette Model (Local Mac)
This notebook trains a model on **ImageNet or ImageNette** locally on Mac with the **modular codebase**.

**Supported Datasets:**
- **ImageNet-1K**: Full 1000-class dataset (~1.28M train, 50K val)
- **ImageNette**: 10-class ImageNet subset (fast for quick trials)
- **Custom ImageNet subsets**: Any ImageFolder-compatible dataset

**Training Command Example (ImageNette):**
```bash
python train.py --model resnet50 --dataset imagenet --data-dir ./imagenette2-160 \
    --epochs 3 --batch-size 128 --scheduler onecycle
```

**Required Dataset Structure:**
```
your_dataset_dir/
    train/
        n01440764/  # class folders
            image1.JPEG
            ...
        n01443537/
            ...
    val/
        n01440764/
            ...
```

**Modular Structure:**
- Datasets in `data_loaders/` - CIFAR-100, ImageNet, easy to extend
- Models in `models/` - ResNet50, WideResNet, clean separation
- Training in `training/` - Optimizer, scheduler, LR finder  
- Utils in `utils/` - Checkpointing, metrics, HuggingFace

## Check Python Environment

In [43]:
import sys
import platform
import os

print("="*70)
print("ENVIRONMENT INFORMATION")
print("="*70)
print(f"Python Version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"Processor: {platform.processor()}")
print(f"Working Directory: {os.getcwd()}")
print("="*70)

ENVIRONMENT INFORMATION
Python Version: 3.12.3 (main, Oct  7 2025, 19:27:29) [Clang 17.0.0 (clang-1700.0.13.5)]
Platform: macOS-15.5-arm64-arm-64bit
Processor: arm
Working Directory: /Users/pandurang/projects/pandurang/resnet50-imagenet-1k


## Check GPU/MPS Availability

In [44]:
try:
    import torch
    
    print("\n" + "="*70)
    print("PYTORCH & GPU DETECTION")
    print("="*70)
    print(f"PyTorch Version: {torch.__version__}")
    
    # Check for CUDA
    if torch.cuda.is_available():
        print(f"✓ CUDA is available")
        print(f"  GPU: {torch.cuda.get_device_name(0)}")
        print(f"  CUDA Version: {torch.version.cuda}")
        device = torch.device('cuda')
    # Check for MPS (Apple Silicon)
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        print(f"✓ Apple MPS (Metal Performance Shaders) is available")
        print(f"  This Mac has Apple Silicon GPU acceleration")
        device = torch.device('mps')
    else:
        print(f"⚠ No GPU detected. Training will use CPU")
        device = torch.device('cpu')
    
    print(f"✓ Using device: {device}")
    print("="*70)
    
except ImportError:
    print("⚠ PyTorch not installed. Will install dependencies in next step.")


PYTORCH & GPU DETECTION
PyTorch Version: 2.9.0
✓ Apple MPS (Metal Performance Shaders) is available
  This Mac has Apple Silicon GPU acceleration
✓ Using device: mps


## Install Dependencies

In [45]:
# Install dependencies from requirements.txt
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Import and Verify Libraries

In [46]:
import torch
import torchvision
import torchsummary
import torchinfo
import tqdm
import matplotlib
import numpy
import plotille
import albumentations

print("\n" + "="*70)
print("LIBRARY VERSIONS")
print("="*70)
print(f"PyTorch: {torch.__version__}")
print(f"TorchVision: {torchvision.__version__}")
print(f"NumPy: {numpy.__version__}")
print(f"Matplotlib: {matplotlib.__version__}")
print(f"Albumentations: {albumentations.__version__}")
print(f"TQDM: {tqdm.__version__}")
print("="*70)
print("✓ All dependencies successfully imported")
print("="*70)


LIBRARY VERSIONS
PyTorch: 2.9.0
TorchVision: 0.24.0
NumPy: 2.2.6
Matplotlib: 3.10.7
Albumentations: 2.0.8
TQDM: 4.67.1
✓ All dependencies successfully imported


## Verify Training Files

In [47]:
import os

print("\n" + "="*70)
print("VERIFYING TRAINING FILES AND MODULAR STRUCTURE")
print("="*70)

# Check required files
required_files = [
    'train.py',
    'config.json',
    'requirements.txt'
]

print("\nRequired Files:")
files_ok = True
for file in required_files:
    exists = os.path.exists(file)
    status = "✓" if exists else "✗"
    print(f"{status} {file}")
    if not exists:
        files_ok = False

# Check modular directories
required_dirs = ['data_loaders', 'models', 'training', 'utils']
print("\nModular Directories:")
dirs_ok = True
for dir in required_dirs:
    exists = os.path.isdir(dir)
    status = "✓" if exists else "✗"
    print(f"{status} {dir}/")
    if not exists:
        dirs_ok = False

print("="*70)
if files_ok and dirs_ok:
    print("✓ All required files and modular structure verified!")
else:
    print("⚠ Some files or directories are missing. Please check your directory.")
print("="*70)


VERIFYING TRAINING FILES AND MODULAR STRUCTURE

Required Files:
✓ train.py
✓ config.json
✓ requirements.txt

Modular Directories:
✓ data_loaders/
✓ models/
✓ training/
✓ utils/
✓ All required files and modular structure verified!


## Training Configuration for ImageNet / ImageNette

The training will use the following configuration with the **modular codebase**:

### Model Options:
- **resnet50** - ResNet-50 (25.6M parameters, from `models/resnet50.py`)
- **wideresnet28-10** - WideResNet-28-10 (36.5M parameters, from `models/wideresnet.py`)

### Dataset: ImageNet / ImageNette
- **ImageNette-160**: 10 classes, 160x160 images (resized to 224x224)
- **ImageNet-1K**: 1000 classes, variable size images (resized to 224x224)
- Number of classes automatically detected from dataset directory

### Training Parameters:
- **Epochs**: 3-10 for quick trials (ImageNette), 90-100 for full training
- **Batch Size**: 128-256 (adjust based on GPU memory)
- **Scheduler**: OneCycle Learning Rate Policy or Cosine
- **LR Finder**: Optional - automatically finds optimal learning rate

### ImageNet-Specific Features:
- **Transforms**: RandomResizedCrop(224), ColorJitter, HorizontalFlip
- **Validation**: Resize(256) → CenterCrop(224)
- **Mixed Precision**: Enabled by default (if supported)
- **MixUp**: Enabled (alpha=0.2)
- **Label Smoothing**: 0.1

### Command Line Parameter:
**IMPORTANT**: You must specify `--data-dir` pointing to your ImageNet/ImageNette directory!

## Start Training on ImageNet / ImageNette

**Before running:** Ensure your dataset is downloaded and the path is correct!

**Training will:**
1. Load ImageNet/ImageNette from the specified `--data-dir`
2. Automatically detect number of classes from directory structure
3. Optionally run LR Finder (if `--lr-finder` is specified)
4. Train with OneCycle scheduler and advanced augmentations
5. Save checkpoints to `checkpoint_N/` folder

**Example Commands:**

```bash
# ImageNette quick trial (3 epochs)
python train.py --model resnet50 --dataset imagenet --data-dir ./imagenette2-160 \
    --epochs 3 --batch-size 128

# ImageNette with LR finder
python train.py --model resnet50 --dataset imagenet --data-dir ./imagenette2-160 \
    --epochs 10 --batch-size 128 --scheduler onecycle --lr-finder

# Full ImageNet training (requires powerful GPU)
python train.py --model resnet50 --dataset imagenet --data-dir /path/to/imagenet \
    --epochs 90 --batch-size 256 --scheduler onecycle
```

**Note:** Update the `--data-dir` parameter below with your actual dataset path!

In [48]:
# Run training on ImageNet/ImageNette
# IMPORTANT: Update --data-dir with your actual dataset path!

# Example: ImageNette training (3 epochs, quick trial)
!python train.py --model resnet50-pytorch --dataset imagenet --data-dir ./imagenette2-160 \
    --epochs 3 --batch-size 64 --scheduler onecycle

# Alternative commands (uncomment to use):

# With LR Finder:
# !python train.py --model resnet50 --dataset imagenet --data-dir ./imagenette2-160 \
#     --epochs 10 --batch-size 128 --scheduler onecycle --lr-finder

# Full ImageNet (adjust path):
# !python train.py --model resnet50 --dataset imagenet --data-dir /path/to/imagenet \
#     --epochs 90 --batch-size 256 --scheduler onecycle

✓ Using Apple Silicon GPU (MPS)
   Recommended: 0-2 workers for Apple Silicon GPU
   Using num_workers=2 for better performance

Loading ImageNet dataset from: ./imagenette2-160
ImageNet dataset loaded:
  Training samples: 9,469
  Validation samples: 3,925
  Number of classes: 10
Detected 10 classes in dataset

Training resnet50-pytorch for 3 epochs on ImageNet (10 classes)

Model Architecture Summary
Device: mps
Model: resnet50-pytorch
Number of classes: 10
Mixed Precision: False
MixUp: True (alpha=0.2)
Label Smoothing: 0.1

Total Parameters: 23,528,522
Trainable Parameters: 23,528,522
Non-trainable Parameters: 0

Model Summary:

⚠ torchsummary.summary failed: slow_conv2d_forward_mps: input(device='cpu') and weight(device=mps:0')  must be on the same device
Falling back to torchinfo...

Layer (type:depth-idx)                   Output Shape              Param #
ResNet                                   [1, 10]                   --
├─Conv2d: 1-1                            [1, 64, 112, 11

## Training Complete - View Results

After training completes, you can view the results below.

In [49]:
# List checkpoint directories
import glob
import json

checkpoint_dirs = sorted(glob.glob('checkpoint_*'), reverse=True)

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    print(f"\n{'='*70}")
    print(f"LATEST CHECKPOINT: {latest_checkpoint}")
    print(f"{'='*70}\n")
    
    # Load and display metrics
    metrics_file = os.path.join(latest_checkpoint, 'metrics.json')
    if os.path.exists(metrics_file):
        with open(metrics_file, 'r') as f:
            metrics = json.load(f)
        
        print(f"Best Test Accuracy: {metrics['best_test_accuracy']:.2f}%")
        print(f"Best Epoch: {metrics['best_epoch']}")
        print(f"Total Epochs Trained: {len(metrics['epochs'])}")
        print(f"\nFinal Metrics:")
        print(f"  - Train Accuracy: {metrics['train_accuracies'][-1]:.2f}%")
        print(f"  - Test Accuracy: {metrics['test_accuracies'][-1]:.2f}%")
        print(f"  - Train Loss: {metrics['train_losses'][-1]:.4f}")
        print(f"  - Test Loss: {metrics['test_losses'][-1]:.4f}")
    
    # List saved files
    print(f"\nSaved Files in {latest_checkpoint}:")
    for file in sorted(os.listdir(latest_checkpoint)):
        file_path = os.path.join(latest_checkpoint, file)
        if os.path.isfile(file_path):
            file_size = os.path.getsize(file_path) / (1024 * 1024)  # MB
            print(f"  - {file} ({file_size:.2f} MB)")
    
    print(f"\n{'='*70}")
else:
    print("No checkpoint directories found. Training may not have completed successfully.")

No checkpoint directories found. Training may not have completed successfully.


## View Training Curves

In [50]:
from IPython.display import Image, display

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    
    # Display training curves
    curves_path = os.path.join(latest_checkpoint, 'training_curves.png')
    if os.path.exists(curves_path):
        print("Training Curves:")
        display(Image(filename=curves_path))
    else:
        print("Training curves not found.")
    
    # Display LR Finder plot
    lr_finder_path = os.path.join(latest_checkpoint, 'lr_finder_plot.png')
    if os.path.exists(lr_finder_path):
        print("\nLR Finder Plot:")
        display(Image(filename=lr_finder_path))
    else:
        print("LR Finder plot not found.")

## Load and Test Best Model

You can load the best saved model and use it for inference or further testing.

In [51]:
import torch

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    best_model_path = os.path.join(latest_checkpoint, 'best_model.pth')
    
    if os.path.exists(best_model_path):
        # Load the checkpoint with weights_only=False for PyTorch 2.6+
        checkpoint = torch.load(best_model_path, map_location='cpu', weights_only=False)
        
        print(f"\n{'='*70}")
        print("BEST MODEL CHECKPOINT INFORMATION")
        print(f"{'='*70}")
        print(f"Epoch: {checkpoint['epoch']}")
        print(f"Train Accuracy: {checkpoint['train_accuracy']:.2f}%")
        print(f"Test Accuracy: {checkpoint['test_accuracy']:.2f}%")
        print(f"Train Loss: {checkpoint['train_loss']:.4f}")
        print(f"Test Loss: {checkpoint['test_loss']:.4f}")
        print(f"Timestamp: {checkpoint['timestamp']}")
        
        print(f"\nModel Configuration:")
        for key, value in checkpoint['config'].items():
            print(f"  - {key}: {value}")
        
        print(f"{'='*70}\n")
        
        # Load model using the new modular structure
        from models import get_model
        
        model_name = checkpoint['config'].get('model', 'resnet50')
        model = get_model(model_name, num_classes=100)
        model.load_state_dict(checkpoint['model_state_dict'])
        model.eval()
        print(f"✓ Model '{model_name}' loaded successfully from modular structure")
        print("✓ Model ready for inference")
    else:
        print("⚠ Best model checkpoint not found.")

## Summary

Training on **ImageNet/ImageNette** is complete using the **modular codebase**!

### 📁 Checkpoint Files:
- **Best Model**: `checkpoint_N/best_model.pth` - Model with best test accuracy
- **Training Curves**: `checkpoint_N/training_curves.png` - Metrics visualization
- **LR Finder Plot**: `checkpoint_N/lr_finder_plot.png` - LR range test (if enabled)
- **Metrics**: `checkpoint_N/metrics.json` - Complete training history
- **Config**: `checkpoint_N/config.json` - Training configuration
- **Model Card**: `checkpoint_N/README.md` - Detailed documentation

### 🏗️ Modular Structure:
- **Data Loaders** (`data_loaders/`) - CIFAR-100, ImageNet, easy to add more
- **Models** (`models/`) - ResNet50, WideResNet, clean separation
- **Training** (`training/`) - Reusable optimizer, scheduler, LR finder
- **Utils** (`utils/`) - Checkpointing, metrics, HuggingFace upload

### 🎯 Model Usage:
```python
import torch
from models import get_model

# Load checkpoint
checkpoint = torch.load('checkpoint_N/best_model.pth', 
                       map_location='cpu', weights_only=False)

# Get model (num_classes auto-detected during training)
num_classes = checkpoint['config'].get('num_classes', 10)  # 10 for ImageNette, 1000 for ImageNet-1K
model = get_model('resnet50', num_classes=num_classes)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
```

### 📊 Supported Datasets:
- **CIFAR-100**: 100 classes, 32x32 images
- **ImageNet-1K**: 1000 classes, 224x224 (resized)
- **ImageNette**: 10 classes, ImageNet subset
- **Custom**: Any ImageFolder-compatible dataset

### 🆕 Available Models:
- `resnet50` - ResNet-50 (25.6M parameters)
- `wideresnet28-10` - WideResNet-28-10 (36.5M parameters)

---

**Modular design makes experimentation easy!**