paper link: https://arxiv.org/pdf/1611.05431

# Deep Residual Learning for Image Recognition

Our method indicates that cardinality (the size of the
set of transformations) is a concrete, measurable dimension that is of central importance, in addition to the dimensions of width and depth. Experiments demonstrate that increasing cardinality is a more effective way of gaining accuracy than going deeper or wider, especially when depth and
width starts to give diminishing returns for existing models.

(i) if producing spatial maps of the same size, the blocks share
the same hyper-parameters (width and filter sizes), and 

(ii) each time when the spatial map is downsampled by a factor of 2, the width of the blocks is multiplied by a factor
of 2.

Arbitrary function that projects into a lower dimensional embedding space and then transforms it. Each transformation is of the same topology (homogeneous). The outputs of the transformations are aggregated by summation.

More modularized design 
Reduced hyper-parameters
Cardinality: representing the number of independent transformation paths to be aggregated

If building a neural network is like organizing a team to solve a complex problem, depth is the number of management layers a decision must pass through, and width is the size of the individual departments. Cardinality is like splitting the problem into many small, identical task forces that work in parallel. Instead of just hiring more people for one giant department (width) or adding more bosses (depth), cardinality proves that having many specialized, parallel paths working on low-dimensional versions of the problem is a more efficient way to reach the correct solution.

Analogy for Aggregated Transformations Imagine you are a head chef (the neuron) tasked with creating a complex sauce. A simple neuron is like you doing every step yourself: chopping one ingredient, then the next, then stirring them all in a single pot. Aggregated transformation is like hiring a team of 32 specialized sous-chefs (cardinality). You give each chef the same recipe (homogeneous topology), but they each work on a small, distinct portion of the ingredients in their own pans (split and transform). Finally, you pour all their individual results into the main pot (aggregate) to create a sauce that is far more complex and refined than what you could have achieved alone in the same amount of time.

Downsampling is performed by the first convolutional layer in the block, with a stride of 2. When the feature map size is halved, the number of output channels is doubled to preserve the computational complexity per layer.

In [1]:
'''
We train the models on the 50k training set and evaluate
on the 10k test set. The input image is 32×32 randomly
cropped from a zero-padded 40×40 image or its flipping,
following [14]. No other data augmentation is used. The
first layer is 3×3 conv with 64 filters. There are 3 stages
each having 3 residual blocks, and the output map size is
32, 16, and 8 for each stage [14]. The network ends with a
global average pooling and a fully-connected layer. Width
is increased by 2× when the stage changes (downsampling),
as in Sec. 3.1.
'''
import torch
from torch import nn
from torchvision import transforms, datasets
from torch.utils.data import DataLoader, Subset

class ResNeXtModule(nn.Module):
    def __init__(self, in_channel, out_channel, reduce_channel, cardinality, stride=1):
        super().__init__()

        assert reduce_channel % cardinality == 0, \
            f"bottleneck_channels ({reduce_channel}) must be divisible by cardinality ({cardinality})"

        self.layers = nn.Sequential(
            nn.Conv2d(in_channel, reduce_channel, kernel_size=1, bias=False),
            nn.BatchNorm2d(reduce_channel),
            nn.ReLU(inplace=True),

            nn.Conv2d(reduce_channel, reduce_channel, kernel_size=3, stride=stride,
                     padding=1, groups=cardinality, bias=False),
            nn.BatchNorm2d(reduce_channel),
            nn.ReLU(inplace=True),

            nn.Conv2d(reduce_channel, out_channel, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channel),
        )

        # Shortcut connection for dimension matching
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channel != out_channel:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channel, out_channel, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channel)
            )

        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        identity = self.shortcut(x)
        out = self.layers(x)
        out += identity
        out = self.relu(out)
        return out

class ResNeXtCIFAR(nn.Module):
    def __init__(self, cardinality=8, width=64, num_classes=100):
        super().__init__()
        
        # First layer: 3×3 conv with 64 filters (no stride, no pooling for CIFAR)
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True)
        )
        
        # Stage 1: 3 blocks, output map size 32×32, width=64
        self.stage1 = self._make_stage(64, 64, width, cardinality, 3, stride=1)
        
        # Stage 2: 3 blocks, output map size 16×16, width=128 (2× increase)
        self.stage2 = self._make_stage(64, 128, width*2, cardinality, 3, stride=2)
        
        # Stage 3: 3 blocks, output map size 8×8, width=256 (2× increase)
        self.stage3 = self._make_stage(128, 256, width*4, cardinality, 3, stride=2)
        
        # Global average pooling + fully connected
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(256, num_classes)
    
    def _make_stage(self, in_channels, out_channels, reduce_channels, cardinality, num_blocks, stride):
        """Create a stage with multiple ResNeXt blocks"""
        layers = []
        
        # First block handles downsampling and channel increase
        layers.append(ResNeXtModule(in_channels, out_channels, reduce_channels, 
                                   cardinality, stride=stride))
        
        # Remaining blocks keep same dimensions
        for _ in range(num_blocks - 1):
            layers.append(ResNeXtModule(out_channels, out_channels, reduce_channels, 
                                       cardinality, stride=1))
        
        return nn.Sequential(*layers)
    
    def forward(self, x):
        x = self.conv1(x)      # 32×32×64
        x = self.stage1(x)     # 32×32×64
        x = self.stage2(x)     # 16×16×128
        x = self.stage3(x)     # 8×8×256
        x = self.avgpool(x)    # 1×1×256
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

In [2]:
torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.set_float32_matmul_precision('high')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    torch.cuda.empty_cache()

train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),  # Mild to avoid over-distortion
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5071, 0.4865, 0.4409], std=[0.2673, 0.2564, 0.2761]),
    transforms.RandomErasing(p=0.25)  # Apply after normalization for consistency
])

test_transform = transforms.Compose([
    transforms.ToTensor(), # Moved ToTensor before Normalize (good practice)
    transforms.Normalize(mean=[0.5071, 0.4867, 0.4408], std=[0.2675, 0.2565, 0.2761])
])

# Load raw datasets
cifar_train_raw = datasets.CIFAR100(root="./data", train=True, download=True, transform=None)

train_size = int(0.9 * len(cifar_train_raw))  # 48,000

train_indices = list(range(0, train_size))
val_indices = list(range(train_size, len(cifar_train_raw)))

# Create datasets with appropriate transforms
cifar_train = Subset(
    datasets.CIFAR100(root="./data", train=True, transform=train_transform),
    train_indices
)
cifar_val = Subset(
    datasets.CIFAR100(root="./data", train=True, transform=test_transform),
    val_indices
)
# Use original test set (10,000 samples) - close to 10% of 60,000
cifar_test = datasets.CIFAR100(root="./data", train=False, transform=test_transform)

train_loader = DataLoader(
    cifar_train,
    batch_size=1024,  # Changed from 1024
    shuffle=True,
    num_workers=2,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=6
)

val_loader = DataLoader(
    cifar_val,  # Use directly
    batch_size=1024,
    shuffle=False,
    num_workers=2,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=6
)

test_loader = DataLoader(
    cifar_test,
    batch_size=1024,
    shuffle=False,
    num_workers=2,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=6
)

num_classes = 100

model = ResNeXtCIFAR(cardinality=8, width=64, num_classes=100).to(device)

num_epochs = 40
loss_function = nn.CrossEntropyLoss(label_smoothing=0.1)
base_lr = 4e-3

batch_scale = 1024 / 256  # 4x larger batches
scaled_lr = base_lr * batch_scale**0.5  # Square root scaling

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-3,  # Changed from 1e-3
    weight_decay=1e-4
)

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,  # Changed from 3e-3
    epochs=num_epochs,
    steps_per_epoch=len(train_loader),
    pct_start=0.3,
    anneal_strategy='cos',
    div_factor=25.0,
    final_div_factor=1000.0
)

best_val_loss = float('inf')

for epoch in range(num_epochs):
    print(f'Starting Epoch {epoch+1}')
    model.train()

    current_loss = 0.0
    num_batches = 0

    for i, data in enumerate(train_loader):
        inputs, targets = data
        inputs, targets = inputs.to(device), targets.to(device)
            
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)  # Changed from 1.0
        
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        scheduler.step()

        current_loss += loss.item()
        num_batches += 1

        if i % 50 == 0:
            print(f'Batch {i}/{len(train_loader)}, Loss: {loss.item():.4f}')


    avg_train_loss = current_loss / num_batches
    print(f'Epoch {epoch+1} finished')
    print(f'Training - Loss: {avg_train_loss:.4f}')

    if (epoch + 1) % 2 == 0:
        model.eval()
        val_loss = 0.0
        val_batches = 0

        print(f'Epoch {epoch+1} finished')
        print(f'average training loss is {avg_train_loss:.4f}')

        with torch.no_grad():
            for val_data in val_loader:
                val_inputs, val_targets = val_data
                val_inputs, val_targets = val_inputs.to(device), val_targets.to(device)  # Convert inputs to FP16

                val_outputs = model(val_inputs)
                val_batch_loss = loss_function(val_outputs, val_targets)

                val_loss += val_batch_loss.item()
                val_batches += 1


        avg_val_loss = val_loss / val_batches

        print(f'Epoch {epoch+1} finished')
        print(f'Training - Loss: {avg_train_loss:.4f}')
        print(f'Validation - Loss: {avg_val_loss:.4f}')

if torch.cuda.is_available():
    torch.cuda.empty_cache()


Using device: cuda
GPU Memory: 15.8 GB


  self.setter(val)
100%|██████████| 169M/169M [00:04<00:00, 35.2MB/s] 


Starting Epoch 1
Batch 0/44, Loss: 4.8192
Epoch 1 finished
Training - Loss: 4.4015
Starting Epoch 2
Batch 0/44, Loss: 4.1460
Epoch 2 finished
Training - Loss: 4.0335
Epoch 2 finished
average training loss is 4.0335
Epoch 2 finished
Training - Loss: 4.0335
Validation - Loss: 3.9083
Starting Epoch 3
Batch 0/44, Loss: 3.8987
Epoch 3 finished
Training - Loss: 3.8083
Starting Epoch 4
Batch 0/44, Loss: 3.7351
Epoch 4 finished
Training - Loss: 3.5647
Epoch 4 finished
average training loss is 3.5647
Epoch 4 finished
Training - Loss: 3.5647
Validation - Loss: 3.9766
Starting Epoch 5
Batch 0/44, Loss: 3.4383
Epoch 5 finished
Training - Loss: 3.3191
Starting Epoch 6
Batch 0/44, Loss: 3.1680
Epoch 6 finished
Training - Loss: 3.0940
Epoch 6 finished
average training loss is 3.0940
Epoch 6 finished
Training - Loss: 3.0940
Validation - Loss: 3.7334
Starting Epoch 7
Batch 0/44, Loss: 3.0051
Epoch 7 finished
Training - Loss: 2.9239
Starting Epoch 8
Batch 0/44, Loss: 2.8587
Epoch 8 finished
Training - L

In [3]:
def evaluate_test_set(model):
    model.eval()
    correct = 0
    total = 0
    
    print("Starting evaluation...")
    
    with torch.no_grad():
        for data in test_loader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            
            # Use autocast for consistency if you trained with it
            outputs = model(images)
            
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f'Accuracy of the network on the test images: {accuracy:.2f}%')
    return accuracy

print("\n=== Running standard evaluation ===")
standard_accuracy = evaluate_test_set(model)   
print(f'Standard Test Accuracy: {standard_accuracy:.4f} ({standard_accuracy*100:.2f}%)')


=== Running standard evaluation ===
Starting evaluation...
Accuracy of the network on the test images: 70.25%
Standard Test Accuracy: 70.2500 (7025.00%)
