In [1]:
import os
import random
from copy import deepcopy

import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam, AdamW

from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report

from torchvision import transforms, models
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader, Subset

import matplotlib.pyplot as plt

from google.colab import drive

drive.mount('/content/drive')



Mounted at /content/drive


In [2]:
# paths and seeds
image_path = '/content/drive/MyDrive/Colab Notebooks/Group30-Project3-main/data'
torch.manual_seed(1)
random.seed(1)
np.random.seed(1)
image_path


'/content/drive/MyDrive/Colab Notebooks/Group30-Project3-main/data'

Our dataset isn’t a clean same logo same pose situation. The Razorback shows up in different colors (red/white), at different orientations (sometimes straight, sometimes a bit diagonal), and with small perspective/scale differences (some images have a tiny hog in the corner, others have it big and centered). There are even cases where a mirrored/left-facing hog appears. If we only train on the exact pixels I have, the model will overfit to those specific looks and fall apart the moment the hog is slightly tilted, smaller, washed out, or shot at an angle.

So the goal with these transforms is to teach invariances the model should have:

Resize(520) -> geometric jitter -> Resize(500): We first bump images to 520×520, then apply the geometric transforms, and finally pull them back to 500×500 without cropping. That extra 20-pixel lets us rotate/translate/shear a bit without chopping off parts of the logo. Ending with a resize instead of a crop keeps the full content in frame.

RandomAffine(degrees=30, translate=10%, scale=0.85–1.15, shear=±8°): Covers the real-world variations we actually see: slight tilt/diagonal placements, off-center logos, small v large logos, and mild stretch.

RandomPerspective(distortion_scale=0.15, p=0.25): Some photos look a little “stretchy” or shot from an angle. A touch of perspective jitter will encourage our model to key in on shape features instead of memorizing a single flat view.

RandomHorizontalFlip(p=0.5): We do see the hog facing either direction sometimes. So with this we let the classifier learn that mirrored is still the same class.

ColorJitter + RandomGrayscale + RandomAutocontrast: The hog can be red or white, and lighting varies. These help the network ignore unhelpful color/exposure quirks and focus on the actual silhouette/edges. We kept the jitter moderate so we don’t destroy the class signal.

Normalize(ImageNet stats): This generally stabilizes training even for scratch CNNs. It keeps input distributions sane so the optimizer doesn’t have to fight scale issues.

In [3]:
# transforms: cover color variants (red/white), orientation (tilt/flip), position (translate),
# size (scale), and mild "stretch" (shear/perspective). Always end at 500x500 without cropping.
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

transform_train = transforms.Compose([
    transforms.Resize((520, 520)),
    transforms.RandomAffine(
        degrees=30,                                 # more tilt coverage (diagonal logos)
        translate=(0.10, 0.10),                     # allow off-center logos
        scale=(0.85, 1.15),                         # logo smaller/bigger in frame
        shear=(-8, 8, -8, 8)                        # mild non-uniform skew
    ),
    transforms.RandomPerspective(distortion_scale=0.15, p=0.25),  # mimic stretchy views
    transforms.RandomHorizontalFlip(p=0.5),          # logo can face either direction
    transforms.ColorJitter(brightness=0.18,
                           contrast=0.18,
                           saturation=0.18,
                           hue=0.03),                # red/white/exposure variation
    transforms.RandomGrayscale(p=0.10),
    transforms.RandomAutocontrast(p=0.15),
    transforms.Resize((500, 500)),
    transforms.ToTensor(),
    normalize,
])

transform_eval = transforms.Compose([
    transforms.Resize((500, 500)),
    transforms.ToTensor(),
    normalize,
])


I use a stratified split per class so train/valid/test keep the same with/without-logo balance. I the ratios 65/20/15 to give validation a few more samples. This helps when we later pick the decision threshold τ* and reduces the chance we overfit that choice to a tiny valid set.

Batch size = 32 is a good spot for our GPU/Colab runtime. It is big enough for stable batch statistics, but small enough to keep memory in check with 500×500 images. We shuffle only on train. The model should see a new order each epoch during training. This gives better gradient estimates.

count_by_class() is just a sanity check function. It catches subtle mistakes (like an imbalance sneaking in after an indexing bug) and verifies that our train/valid/test distributions are comparable.

In [4]:
# datasets
full_ds_train = ImageFolder(root=image_path, transform=transform_train)
full_ds_eval  = ImageFolder(root=image_path, transform=transform_eval)
class_names = full_ds_eval.classes

# stratified 65/20/15 split
targets = np.array(full_ds_eval.targets)
classes = np.unique(targets)
g = torch.Generator().manual_seed(1)

r_train, r_valid, r_test = 0.65, 0.20, 0.15
train_idx, valid_idx, test_idx = [], [], []
for c in classes:
    idx_c = np.where(targets == c)[0]
    idx_c = torch.tensor(idx_c)[torch.randperm(len(idx_c), generator=g)].tolist()
    n = len(idx_c)
    n_train = int(r_train * n)
    n_valid = int(r_valid * n)
    n_test  = n - n_train - n_valid
    train_idx += idx_c[:n_train]
    valid_idx += idx_c[n_train:n_train+n_valid]
    test_idx  += idx_c[n_train+n_valid:]

train_ds = Subset(full_ds_train, train_idx)
valid_ds = Subset(full_ds_eval,  valid_idx)
test_ds  = Subset(full_ds_eval,  test_idx)

print(f"train: {len(train_ds)} | valid: {len(valid_ds)} | test: {len(test_ds)}")

batch_size = 32
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True,  num_workers=2, pin_memory=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True)
test_dl  = DataLoader(test_ds,  batch_size=batch_size, shuffle=False, num_workers=2, pin_memory=True)

def count_by_class(ds):
    base_targets = np.array(ds.dataset.targets)
    split_targets = base_targets[np.array(ds.indices)]
    vals, counts = np.unique(split_targets, return_counts=True)
    return {class_names[v]: int(c) for v, c in zip(vals, counts)}

print("train counts:", count_by_class(train_ds))
print("valid counts:", count_by_class(valid_ds))
print("test  counts:", count_by_class(test_ds))

train: 107 | valid: 33 | test: 26
train counts: {'with_logo': 52, 'without_logo': 55}
valid counts: {'with_logo': 16, 'without_logo': 17}
test  counts: {'with_logo': 13, 'without_logo': 13}


We built a small deepish network. It has enough layers to learn shapes and parts of the Razorback logo, but not so big that it memorizes backgrounds. Each block does the same simple pattern to extract features, we have an optional BatchNorm to steady the activations, ReLU to add nonlinearity and avoid vanishing gradients, and then a max-pool to shrink the image a bit. We used pooling because shrinking forces the model to look at bigger and bigger chunks of the picture each time. Early layers catch edges and curves, later layers can recognize the logo outline regardless of where it sits. We use ReLU over other activations because it’s fast, stable, and works well on small datasets. We also put dropout in the earlier blocks. This is just us adding a little noise so the model doesn’t overfit to one specific background or lighting condition and instead learns general logo cues.By dropout we mean randomly turning off a fraction of the units during training so the network cannot rely on any single feature every time, which helps prevent co-adaptation of neurons. We also use 3×3 convolutions with padding=1 inside each block so width and height stay the same before pooling, which keeps edge pixels in play and avoids losing content too early.

After four downsampling stages, we add one more convolution without pooling. That extra conv gives the network a last mixing step across channels while keeping spatial resolution steady. We’re basically just asking the network to combine all the pieces it already learned one last time. This final combination (global average pooling → linear layer) pulls together the logo cues found across the image into a single confidence score, so we don’t overreact to any one pixel or patch.

At the head we use global average pooling instead of flattening a giant feature map. Flattening would create a ton of parameters and this could overfit quickly (especially for a small dataset). Global average pooling compresses each channel down to a single score that roughly means “how much of this feature exists anywhere in the image.” Which fits our task of looking for the razorback logo in any part of a picture.Finally a single linear layer produces one logit (not a sigmoid yet). We include a single fully connected layer at the end to turn the pooled feature strengths into one decision score. We only do one layer and not dense layers since this would end up having a lot of parameters and likely overfit our small dataset, teaching the model to memorize where the logo happened to appear instead of simply telling us whether it is present.
We train with BCEWithLogitsLoss (good for Binary classification), which wraps the sigmoid internally in a numerically stable way. At evaluation time, we pass the logit through a sigmoid to get a probability and then pick a decision threshold on the validation set instead of just using 0.5. As we go deeper, the feature maps get smaller (because of pooling), but the ideas we want the network to hold onto get more abstract (since we go from edges and corners to parts of a hog to the hog logo itself). So we add more channels (32 → 64 → 128 → 256) so the network can store richer features even though each map is smaller. As the spatial size shrinks and the number of channels grows, the network is able to capture complex logo patterns without blowing up the computing cost.

In [5]:
def build_cnn(ch=(32,64,128,256), pdrop=0.25, use_bn=True):
    m = nn.Sequential()
    # block 1
    m.add_module('conv1', nn.Conv2d(3, ch[0], 3, padding=1))
    if use_bn: m.add_module('bn1', nn.BatchNorm2d(ch[0]))
    m.add_module('relu1', nn.ReLU())
    m.add_module('pool1', nn.MaxPool2d(2))
    m.add_module('drop1', nn.Dropout(pdrop))
    # block 2
    m.add_module('conv2', nn.Conv2d(ch[0], ch[1], 3, padding=1))
    if use_bn: m.add_module('bn2', nn.BatchNorm2d(ch[1]))
    m.add_module('relu2', nn.ReLU())
    m.add_module('pool2', nn.MaxPool2d(2))
    m.add_module('drop2', nn.Dropout(pdrop))
    # block 3
    m.add_module('conv3', nn.Conv2d(ch[1], ch[2], 3, padding=1))
    if use_bn: m.add_module('bn3', nn.BatchNorm2d(ch[2]))
    m.add_module('relu3', nn.ReLU())
    m.add_module('pool3', nn.MaxPool2d(2))
    # block 4
    m.add_module('conv4', nn.Conv2d(ch[2], ch[3], 3, padding=1))
    if use_bn: m.add_module('bn4', nn.BatchNorm2d(ch[3]))
    m.add_module('relu4', nn.ReLU())
    m.add_module('pool4', nn.MaxPool2d(2))
    # block 5 (no pool)
    m.add_module('conv5', nn.Conv2d(ch[3], ch[3], 3, padding=1))
    if use_bn: m.add_module('bn5', nn.BatchNorm2d(ch[3]))
    m.add_module('relu5', nn.ReLU())
    # head
    m.add_module('gap', nn.AdaptiveAvgPool2d(1))
    m.add_module('flatten', nn.Flatten())
    m.add_module('fc', nn.Linear(ch[3], 1))
    return m


Then we move into training and to turn these layers into a Razorback detector that generalizes to new photos. We use BCEWithLogitsLoss for our Binary Classification. Because the classes are not perfectly balanced, we pass a pos_weight so the model does not slide toward the majority label. The optimizer is Adam with a small weight decay to keep parameters from growing without restraint. On top of that we add ReduceLROnPlateau which watches validation loss and if progress stalls it automatically lowers the learning rate so updates become smaller and steadier.

To avoid overfitting we keep the best version of the model as training unfolds. Each time the validation loss gets better, we save the weights. When training finishes, we reload that best copy. This ensures the model we evaluate is the one that actually performed best on unseen validation images and not simply the one from the last epoch.

We also separate training from threshold selection. The head outputs a single logit. At evaluation time we pass it through a sigmoid to get a probability. Instead of assuming that 0.50 is the right cutoff, we try a range of cutoffs on the validation set from 0.10 to 0.90. For each cutoff we compute precision, recall, and F1. Precision tells us how many of the images that we predicted had the logo truly had it. Recall tells us how many of the images with logos we caught. And F1 is the harmonic mean of precision and recall. We pick the cutoff that gives the highest F1 because our goal is to balance missed logos and false alarms rather than optimize only one of those numbers. Finally, we keep the test set completely untouched until the very end and run it one time using the chosen threshold.

In [6]:
# training/eval utilities
# train for a few epochs and keep the best weights by validation loss

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def train_model(model, train_dl, valid_dl, pos_weight, lr=1e-3, wd=1e-4, epochs=25, patience=3):
    model = model.to(device)
    loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight.to(device))
    opt = Adam(model.parameters(), lr=lr, weight_decay=wd)
    sched = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, mode='min', factor=0.5, patience=patience)

    best_val = float('inf'); best_state = None
    hist = {'train_loss':[], 'valid_loss':[], 'train_acc':[], 'valid_acc':[]}

    for ep in range(1, epochs+1):
        model.train()
        tr_loss=0.0; tr_hit=0; tr_n=0
        for xb, yb in train_dl:
            xb, yb = xb.to(device), yb.float().to(device)
            opt.zero_grad()
            logits = model(xb).squeeze(1)
            loss = loss_fn(logits, yb)
            loss.backward(); opt.step()
            tr_loss += loss.item()*xb.size(0)
            preds = (logits>=0).long()
            tr_hit += (preds==yb.long()).sum().item()
            tr_n += xb.size(0)
        tr_loss/=tr_n; tr_acc=tr_hit/tr_n

        model.eval()
        va_loss=0.0; va_hit=0; va_n=0
        with torch.no_grad():
            for xb, yb in valid_dl:
                xb, yb = xb.to(device), yb.float().to(device)
                logits = model(xb).squeeze(1)
                loss = loss_fn(logits, yb)
                va_loss += loss.item()*xb.size(0)
                preds = (logits>=0).long()
                va_hit += (preds==yb.long()).sum().item()
                va_n += xb.size(0)
        va_loss/=va_n; va_acc=va_hit/va_n
        hist['train_loss'].append(tr_loss); hist['valid_loss'].append(va_loss)
        hist['train_acc'].append(tr_acc);  hist['valid_acc'].append(va_acc)

        sched.step(va_loss)
        if va_loss < best_val:
            best_val = va_loss
            best_state = {k:v.cpu().clone() for k,v in model.state_dict().items()}

        # print every 5 epochs
        if ep == 1 or ep % 5 == 0 or ep == epochs:
            print(f'epoch {ep:02d}  train_loss {tr_loss:.4f}  val_loss {va_loss:.4f}  '
                  f'train_acc {tr_acc:.3f}  val_acc {va_acc:.3f}')

    if best_state is not None:
        model.load_state_dict(best_state); model.to(device)
    return model, hist


In [7]:
# validation threshold search
# pick the decision threshold that gives the best f1 on validation and also report auc
def pick_threshold(model, valid_dl, class_names):
    model.eval()
    all_p=[]; all_y=[]
    with torch.no_grad():
        for xb, yb in valid_dl:
            xb = xb.to(device)
            p = torch.sigmoid(model(xb).squeeze(1)).cpu().numpy()
            all_p.append(p); all_y.append(yb.numpy())
    all_p = np.concatenate(all_p); all_y = np.concatenate(all_y)
    auc = roc_auc_score(all_y, all_p)

    best=(0.0, 0.5, None, None)
    for t in np.linspace(0.10, 0.90, 41):
        preds=(all_p>=t).astype(int)
        cm = confusion_matrix(all_y, preds)
        rep = classification_report(all_y, preds, target_names=class_names, digits=3)
        tp=((preds==1)&(all_y==1)).sum(); fp=((preds==1)&(all_y==0)).sum(); fn=((preds==0)&(all_y==1)).sum()
        prec = tp/(tp+fp+1e-9); rec = tp/(tp+fn+1e-9)
        f1 = 2*prec*rec/(prec+rec+1e-9)
        if f1>best[0]:
            best=(f1,t,cm,rep)
    return {'auc':auc,'best_f1':best[0],'thr':best[1],'cm':best[2],'report':best[3]}


Now we try a controlled comparison across a few settings we believe actually matter. We keep the split locked and recompute recompute the class weight from the train split so every run sees the same class imbalance correction. The settings we vary are channel sizes, dropout, batch norm, learning rate, weight decay, and a fixed training length. Bigger channels give the network more capacity to represent higher-level logo cues after pooling has shrunk the image. And because convolution uses the same small filters at every location, increasing the number of channels is how we let the model track more distinct patterns without exploding the parameter count. Dropout (like we mentioned before) is our overfitting brake that randomly drops features during training so the network can learn signals that generalize and also prevent co-adaptation of neurons. Batch norm steadies activations and makes optimization less picky about learning rate which helps on a small dataset. Learning rate controls how big of a step each update is while weight decay gently pulls weights toward smaller values that usually help generalize better. We keep epochs modest since we are selecting by validation and we already have a schedule that slows the learning rate when progress stalls.

For each configuration, we build the CNN with those settings, train on the train split while monitoring the validation split, and keep the best weights by validation loss. After training a candidate, we select a decision threshold on the validation set and record F1, AUC, the confusion matrix, and the chosen threshold. We also save the training history and a copy of the model parameters so we can reload the exact winning candidate later without retraining.

As we mentioned we choose the winner by validation F1 because we want a balance between catching true logos and avoiding false alarms. AUC is reported alongside it to show how well the model ranks positives above negatives across all possible thresholds.

In [14]:
# loops over a grid and tracks the best validation f1
# includes channel sizes, dropout, batchnorm on/off, learning rate, weight decay.

counts = count_by_class(train_ds)
n_pos = counts.get('with_logo', 1)
n_neg = counts.get('without_logo', 1)
pos_weight = torch.tensor([n_neg/max(1,n_pos)])

search_space = [
    # (channels, dropout, use_bn, lr, wd, epochs)
    ((32,64,128,256), 0.25, True,  1e-3, 1e-4, 40),
    ((32,64,128,256), 0.40, True,  1e-3, 1e-4, 40),
    ((32,64,96,128),  0.25, True,  1e-3, 5e-5, 40),
    ((32,64,128,256), 0.25, False, 1e-3, 1e-4, 40),
    ((32,64,128,256), 0.25, True,  5e-4, 1e-4, 40),
]

results = []
best_pack = None

for i,(chs,pdrop,use_bn,lr,wd,epochs) in enumerate(search_space, start=1):
    print(f'\n=== experiment {i}: ch={chs}, drop={pdrop}, bn={use_bn}, lr={lr}, wd={wd}, epochs={epochs} ===')
    model_i = build_cnn(ch=chs, pdrop=pdrop, use_bn=use_bn)
    model_i, hist_i = train_model(model_i, train_dl, valid_dl, pos_weight, lr=lr, wd=wd, epochs=epochs, patience=5)
    val_pick = pick_threshold(model_i, valid_dl, class_names)
    rec = {
        'index': i,
        'chs': chs, 'drop': pdrop, 'bn': use_bn, 'lr': lr, 'wd': wd, 'epochs': epochs,
        'val_auc': float(val_pick['auc']),
        'val_best_f1': float(val_pick['best_f1']),
        'thr': float(val_pick['thr']),
        'cm': val_pick['cm'],
        'report': val_pick['report'],
        'hist': hist_i,
        'state_dict': {k:v.cpu() for k,v in model_i.state_dict().items()}
    }
    results.append(rec)
    if (best_pack is None) or (rec['val_best_f1'] > best_pack['val_best_f1']):
        best_pack = rec

print('\nchosen by best validation f1:')
print(f"exp {best_pack['index']}  f1={best_pack['val_best_f1']:.3f}  auc={best_pack['val_auc']:.3f}  thr={best_pack['thr']:.2f}")
print('confusion:\n', best_pack['cm'])
print(best_pack['report'])



=== experiment 1: ch=(32, 64, 128, 256), drop=0.25, bn=True, lr=0.001, wd=0.0001, epochs=40 ===
epoch 01  train_loss 0.8325  val_loss 0.7281  train_acc 0.402  val_acc 0.394
epoch 05  train_loss 0.6896  val_loss 0.8612  train_acc 0.589  val_acc 0.455
epoch 10  train_loss 0.6676  val_loss 0.7062  train_acc 0.589  val_acc 0.515
epoch 15  train_loss 0.6239  val_loss 0.7346  train_acc 0.645  val_acc 0.545
epoch 20  train_loss 0.5869  val_loss 0.6535  train_acc 0.738  val_acc 0.697
epoch 25  train_loss 0.5824  val_loss 0.6498  train_acc 0.692  val_acc 0.636
epoch 30  train_loss 0.5854  val_loss 0.6418  train_acc 0.673  val_acc 0.636
epoch 35  train_loss 0.5640  val_loss 0.6669  train_acc 0.729  val_acc 0.636
epoch 40  train_loss 0.5079  val_loss 0.6438  train_acc 0.804  val_acc 0.576

=== experiment 2: ch=(32, 64, 128, 256), drop=0.4, bn=True, lr=0.001, wd=0.0001, epochs=40 ===
epoch 01  train_loss 0.7622  val_loss 0.6280  train_acc 0.570  val_acc 0.697
epoch 05  train_loss 0.7104  val_loss

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize


=== experiment 3: ch=(32, 64, 96, 128), drop=0.25, bn=True, lr=0.001, wd=5e-05, epochs=40 ===


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

epoch 01  train_loss 0.7992  val_loss 0.7514  train_acc 0.495  val_acc 0.485
epoch 05  train_loss 0.6792  val_loss 0.7264  train_acc 0.626  val_acc 0.515
epoch 10  train_loss 0.6546  val_loss 0.6989  train_acc 0.617  val_acc 0.545
epoch 15  train_loss 0.6527  val_loss 0.7883  train_acc 0.617  val_acc 0.545
epoch 20  train_loss 0.6114  val_loss 0.7429  train_acc 0.673  val_acc 0.606
epoch 25  train_loss 0.5605  val_loss 0.7594  train_acc 0.729  val_acc 0.576
epoch 30  train_loss 0.5942  val_loss 0.7451  train_acc 0.682  val_acc 0.606
epoch 35  train_loss 0.5499  val_loss 0.7461  train_acc 0.748  val_acc 0.636
epoch 40  train_loss 0.5920  val_loss 0.7279  train_acc 0.682  val_acc 0.636


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize


=== experiment 4: ch=(32, 64, 128, 256), drop=0.25, bn=False, lr=0.001, wd=0.0001, epochs=40 ===
epoch 01  train_loss 0.8065  val_loss 0.7119  train_acc 0.514  val_acc 0.515
epoch 05  train_loss 0.7145  val_loss 0.7143  train_acc 0.486  val_acc 0.485
epoch 10  train_loss 0.7110  val_loss 0.7135  train_acc 0.514  val_acc 0.515
epoch 15  train_loss 0.7089  val_loss 0.7151  train_acc 0.514  val_acc 0.515
epoch 20  train_loss 0.7084  val_loss 0.7171  train_acc 0.514  val_acc 0.515
epoch 25  train_loss 0.7053  val_loss 0.7184  train_acc 0.533  val_acc 0.515
epoch 30  train_loss 0.7042  val_loss 0.7209  train_acc 0.542  val_acc 0.515
epoch 35  train_loss 0.6984  val_loss 0.7235  train_acc 0.542  val_acc 0.515
epoch 40  train_loss 0.6971  val_loss 0.7255  train_acc 0.523  val_acc 0.515


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize


=== experiment 5: ch=(32, 64, 128, 256), drop=0.25, bn=True, lr=0.0005, wd=0.0001, epochs=40 ===


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

epoch 01  train_loss 0.7830  val_loss 0.7200  train_acc 0.514  val_acc 0.485
epoch 05  train_loss 0.6802  val_loss 0.7256  train_acc 0.570  val_acc 0.485
epoch 10  train_loss 0.6501  val_loss 0.6662  train_acc 0.598  val_acc 0.545
epoch 15  train_loss 0.5953  val_loss 0.6782  train_acc 0.701  val_acc 0.606
epoch 20  train_loss 0.6012  val_loss 0.6744  train_acc 0.664  val_acc 0.545
epoch 25  train_loss 0.5449  val_loss 0.6409  train_acc 0.710  val_acc 0.667
epoch 30  train_loss 0.5323  val_loss 0.6747  train_acc 0.738  val_acc 0.576
epoch 35  train_loss 0.5122  val_loss 0.6859  train_acc 0.757  val_acc 0.606
epoch 40  train_loss 0.5171  val_loss 0.7550  train_acc 0.748  val_acc 0.576


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



chosen by best validation f1:
exp 5  f1=0.778  auc=0.754  thr=0.48
confusion:
 [[11  5]
 [ 3 14]]
              precision    recall  f1-score   support

   with_logo      0.786     0.688     0.733        16
without_logo      0.737     0.824     0.778        17

    accuracy                          0.758        33
   macro avg      0.761     0.756     0.756        33
weighted avg      0.761     0.758     0.756        33



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Candidate 5 was our winner. With a threshold of 0.48 it lands at F1 = 0.778 and AUC = 0.757. So the model separates the two classes pretty well but there is room for improvement. The confusion matrix tells us that out of 16 true “with_logo” images, we correctly caught 11 and missed 5. Out of 17 “without_logo” images, we correctly rejected 14 and only mis-flagged 3. Overall accuracy sits at 0.758 and our performance is pretty even across classes.

At this point we’ve probably done what we can for this custom CNN. We tuned channels, dropout, batch norm, learning rate, weight decay, and the decision threshold. We used careful augmentation and locked splits.

Given the capacity of this architecture and the size of our dataset, more epochs or tiny hyperparameter nudges will move errors around rather than have a big accuracy change. Deeper pretrained backbones would bring wider receptive fields and a larger bank of filters that capture textures and part configurations learned from millions of images. So our next step is to bring in stronger features through transfer learning. A ResNet pretrained on ImageNet already encodes edges, corners, textures, and multi-scale patterns, and its residual connections make fine-tuning stable on small datasets. Fine-tuning that backbone on our Razorback task should give us better invariance to color, orientation, and scale with the same amount of data.

We use Early Stopping to help us fro overfitting. We watch validation loss because it reflects how well the model handles new images. If that loss stops getting better for several epochs in a row, early stopping will stop training and keep the best version we saw along the way. This prevents the network from memorizing background quirks once it has already learned the useful logo patterns. The patience value gives the model a few chances to improve before we call the stopping, and saving the best weights ensures the model we evaluate is the one that actually generalized the best during training.

In [8]:
# for early stopping
class EarlyStopper:
    def __init__(self, patience=8, min_delta=0.0):
        self.patience = patience
        self.min_delta = min_delta
        self.best = float("inf")
        self.count = 0
        self.best_state = None
        self.best_epoch = 0

    def step(self, val_loss, model, epoch):
        improved = (self.best - val_loss) > self.min_delta
        if improved:
            self.best = val_loss
            self.count = 0
            # keep a copy of best weights
            self.best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
            self.best_epoch = epoch
        else:
            self.count += 1
        return self.count >= self.patience


Now we switch to a ResNet-based setup to get the most out of transfer learning. The build_resnet18_binary function loads a ResNet-18 backbone that was trained on ImageNet when pretrained=True. That gives us a network that already knows a lot of generic visual features like edges, corners, textures, and part shapes, which is exactly what we want when our own dataset is small. We replace only the final classifier so it outputs a single logit for our binary label.
From there we control who learns and when. “Freezing” a layer means we stop updating its weights by setting requires_grad=False, so during the first phase we freeze the entire backbone and train only the new fully connected head, which is a single linear layer that reads ResNet’s final feature vector and turns it into one decision score. After that we “unfreeze” just the last ResNet block and keep earlier layers frozen. The early layers capture very general patterns that we want to preserve, while the later layers are more task specific and benefit from a little fine-tuning at a smaller learning rate. This two-phase plan lets us adapt the model to Razorback cues without overfitting the small dataset.

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def build_resnet18_binary(pretrained=True):
    if pretrained:
        try:
            weights = models.ResNet18_Weights.IMAGENET1K_V1
            net = models.resnet18(weights=weights)
        except:
            net = models.resnet18(pretrained=True)
    else:
        net = models.resnet18(weights=None)
    # replace the classifier for binary logits
    in_feats = net.fc.in_features
    net.fc = nn.Linear(in_feats, 1)
    return net

# freeze/unfreeze
def set_requires_grad(module, requires_grad: bool):
    for p in module.parameters():
        p.requires_grad = requires_grad

Now we actually fine-tune ResNet in two phases. First we load a pretrained ResNet-18 that we talked about with its replaced final classifier with a single-logit layer for our binary task, then we freeze the whole backbone and train only that new head. As we mentioned the early and middle layers already encode generic edges, textures, and part shapes from ImageNet, so we leave them alone and just teach the last layer how to map those features to “with_logo” vs “without_logo.” Our early stopper is also in play watching that validation loss.

After that we move to phase two. We keep the early blocks frozen and unfreeze only layer4 plus the head. layer4 is closest to the classifier and holds the most task-specific features, so letting it learn at a smaller learning rate lets ResNet adapt its higher-level patterns to Razorback cues while preserving the general features learned earlier. That is why we pass two parameter groups to AdamW. One group with a conservative lr_backbone for layer4, and one with a slightly larger rate for the head. We reuse the same validation-driven scheduler and early stopper in this phase as well. Throughout both phases we track train and validation loss and accuracy, always restoring the best validation weights before returning the model and the training history for each phase.

In [10]:
def train_resnet18_two_phase(train_dl, valid_dl, class_names,
                             pos_weight,  # tensor([neg/pos])
                             epochs_head=20, epochs_unfreeze=40,
                             lr_head=1e-3, lr_backbone=1e-4, wd=1e-4,
                             patience=8, min_delta=1e-3, print_every=5):
    model = build_resnet18_binary(pretrained=True).to(device)
    loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight.to(device))

    # 1: train head only
    set_requires_grad(model, False)
    set_requires_grad(model.fc, True)

    opt = AdamW(model.fc.parameters(), lr=lr_head, weight_decay=wd)
    sched = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, mode='min', factor=0.5, patience=3)
    stopper = EarlyStopper(patience=patience, min_delta=min_delta)

    def run_epochs(n_epochs, desc):
        hist = {'train_loss':[], 'valid_loss':[], 'train_acc':[], 'valid_acc':[], 'lrs':[]}
        for ep in range(1, n_epochs+1):
            # train
            model.train()
            tr_loss=0.0; tr_hit=0; tr_n=0
            for xb, yb in train_dl:
                xb, yb = xb.to(device), yb.float().to(device)
                opt.zero_grad()
                logits = model(xb).squeeze(1)
                loss = loss_fn(logits, yb)
                loss.backward(); opt.step()
                tr_loss += loss.item() * xb.size(0)
                tr_hit  += ((logits>=0).long() == yb.long()).sum().item()
                tr_n    += xb.size(0)
            tr_loss/=tr_n; tr_acc=tr_hit/tr_n

            # valid
            model.eval()
            va_loss=0.0; va_hit=0; va_n=0
            with torch.no_grad():
                for xb, yb in valid_dl:
                    xb, yb = xb.to(device), yb.float().to(device)
                    logits = model(xb).squeeze(1)
                    loss = loss_fn(logits, yb)
                    va_loss += loss.item()*xb.size(0)
                    va_hit  += ((logits>=0).long() == yb.long()).sum().item()
                    va_n    += xb.size(0)
            va_loss/=va_n; va_acc=va_hit/va_n

            hist['train_loss'].append(tr_loss); hist['valid_loss'].append(va_loss)
            hist['train_acc'].append(tr_acc);  hist['valid_acc'].append(va_acc)
            hist['lrs'].append(opt.param_groups[0]['lr'])

            sched.step(va_loss)
            if (ep==1) or (ep%print_every==0) or (ep==n_epochs):
                print(f'{desc} epoch {ep:03d}  train_loss {tr_loss:.4f}  val_loss {va_loss:.4f}  '
                      f'train_acc {tr_acc:.3f}  val_acc {va_acc:.3f}  lr {opt.param_groups[0]["lr"]:.2e}')

            if stopper.step(va_loss, model, ep):
                print(f'{desc} Early stopping at epoch {ep} (best {stopper.best_epoch} val_loss {stopper.best:.4f})')
                break

        # restore best
        if stopper.best_state is not None:
            model.load_state_dict(stopper.best_state)
            print(f'{desc} loaded best weights from epoch {stopper.best_epoch}')
        return hist

    print('--- Phase 1: head-only fine-tune ---')
    hist1 = run_epochs(epochs_head, desc='[Head]')

    # 2: unfreeze last block (layer4) + head
    print('--- Phase 2: unfreeze layer4 + head ---')
    set_requires_grad(model, False)
    set_requires_grad(model.layer4, True)
    set_requires_grad(model.fc, True)

    # different LRs for backbone vs head
    params = [
        {'params': model.layer4.parameters(), 'lr': lr_backbone},
        {'params': model.fc.parameters(),      'lr': max(lr_backbone*5, lr_head/2)},
    ]
    opt = AdamW(params, weight_decay=wd)
    sched = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, mode='min', factor=0.5, patience=3)
    stopper = EarlyStopper(patience=patience, min_delta=min_delta)  # reset stopper

    hist2 = run_epochs(epochs_unfreeze, desc='[Unfreeze]')

    return model, {'phase1': hist1, 'phase2': hist2}


Here we run the full “locked” evaluation pass. First we recompute pos_weight from the train split only. Then we fine-tune ResNet-18 with the two-phase plan. When training finishes, we keep the best validation snapshot and move to threshold picking. We use the validation loader to convert logits to probabilities and try a range of cutoffs to find the one that gives the best F1 for this dataset.

Only after the threshold is set do we touch the test set. We run the model once across test, collect probabilities, apply the chosen cutoff, and report AUC, the confusion matrix, and the full classification report.

In [11]:
# here we train with locked protocol
counts = count_by_class(train_ds)
n_pos  = counts.get('with_logo', 1)
n_neg  = counts.get('without_logo', 1)
pos_weight = torch.tensor([n_neg/max(1, n_pos)])

model_resnet, hist = train_resnet18_two_phase(
    train_dl, valid_dl, class_names, pos_weight,
    epochs_head=20,
    epochs_unfreeze=50,
    lr_head=1e-3,
    lr_backbone=1e-4,
    wd=1e-4,
    patience=10, min_delta=1e-3, print_every=5
)

# threshold on valid, then one-shot test
val_pick = pick_threshold(model_resnet, valid_dl, class_names)
tau_star = float(val_pick['thr'])
print("\n=== validation pick (frozen) ===")
print(f"AUC={float(val_pick['auc']):.3f}  best F1={float(val_pick['best_f1']):.3f}  tau*={tau_star:.2f}")
print("confusion:\n", val_pick['cm'])
print(val_pick['report'])

model_resnet.eval()
all_y, all_p = [], []
with torch.no_grad():
    for xb, yb in test_dl:
        p = torch.sigmoid(model_resnet(xb.to(device)).squeeze(1)).cpu().numpy()
        all_p.append(p); all_y.append(yb.numpy())
all_p = np.concatenate(all_p); all_y = np.concatenate(all_y)
preds = (all_p >= tau_star).astype(int)

print("\n=== final test (single shot) ===")
print("test AUC:", roc_auc_score(all_y, all_p))
print("test confusion:\n", confusion_matrix(all_y, preds))
print(classification_report(all_y, preds, target_names=class_names, digits=3))
print(f"(decision threshold used: tau* = {tau_star:.2f})")


Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


100%|██████████| 44.7M/44.7M [00:00<00:00, 195MB/s]


--- Phase 1: head-only fine-tune ---
[Head] epoch 001  train_loss 0.7676  val_loss 0.7085  train_acc 0.439  val_acc 0.515  lr 1.00e-03
[Head] epoch 005  train_loss 0.6899  val_loss 0.6439  train_acc 0.607  val_acc 0.667  lr 1.00e-03
[Head] epoch 010  train_loss 0.6269  val_loss 0.6157  train_acc 0.710  val_acc 0.697  lr 1.00e-03
[Head] epoch 015  train_loss 0.5775  val_loss 0.6046  train_acc 0.813  val_acc 0.727  lr 1.00e-03
[Head] epoch 020  train_loss 0.5370  val_loss 0.5900  train_acc 0.850  val_acc 0.727  lr 1.00e-03
[Head] loaded best weights from epoch 20
--- Phase 2: unfreeze layer4 + head ---
[Unfreeze] epoch 001  train_loss 0.5280  val_loss 0.4836  train_acc 0.813  val_acc 0.667  lr 1.00e-04
[Unfreeze] epoch 005  train_loss 0.1132  val_loss 0.1965  train_acc 0.972  val_acc 0.939  lr 1.00e-04
[Unfreeze] epoch 010  train_loss 0.0276  val_loss 0.1632  train_acc 1.000  val_acc 0.939  lr 1.00e-04
[Unfreeze] epoch 015  train_loss 0.0190  val_loss 0.2392  train_acc 1.000  val_acc 0.9

Phase 2 clearly won. Once we unfroze layer4, validation loss dropped fast and the early stopper settled on epoch 9 as the best snapshot. On that locked validation split we hit AUC = 1.00 and F1 = 1.00, which usually means the split is small and the model fit it almost perfectly. That is why the real check is the held-out test set. The test set was never used for training, tuning, or threshold picking, so its numbers reflect how the model behaves on truly unseen images. There we land around AUC ≈ 0.87 and accuracy ≈ 0.77 with τ* = 0.66, which is a solid bump over the custom CNN and a more honest picture of generalization.

Compared to our custom CNN’s validation results (AUC 0.757, F1 0.778, accuracy 0.758 at τ = 0.48), the ResNet shows a stronger ranking signal overall and better recall on the with_logo class on truly unseen data. With ResNet, with_logo recall rises from about 0.69 to about 0.85, which means we miss far fewer actual logos. Overall accuracy is in the same range, but the higher AUC and the jump in with_logo recall are the big wins for our goal of reliably finding razorbacks.

In [12]:
final_model = model_resnet
final_model.eval()
final_model.to('cpu')

save_dir = "/content/drive/MyDrive/Colab Notebooks/Group30-Project3-main"
os.makedirs(save_dir, exist_ok=True)

save_path = os.path.join(save_dir, "Group_30_CNN_FullModel.ph")
torch.save(final_model, save_path)

print(f"Saved full model to: {save_path}")

Saved full model to: /content/drive/MyDrive/Colab Notebooks/Group30-Project3-main/Group_30_CNN_FullModel.ph
