HHU Deep Representation Learning, Prof. Dr. Markus Kollmann

Lecturers and Tutoring is done by Nikolas Adaloglou and Felix Michels.

# Assignment 07 - CLIP for unsupervised OOD detection (2-week exercise)


---

Submit the solved notebook (not a zip) with your full name plus assingment number for the filename as an indicator, e.g `max_mustermann_a1.ipynb` for assignment 1.

This is a **two week exercise**. If we feel like you have genuinely tried to solve the exercise, you will receive **2** points for this assignment, regardless of the quality of your solution.

## <center> DUE FRIDAY 27.06.2025 </center>

Drop-off link: [https://uni-duesseldorf.sciebo.de/s/QUtbyMFxxKPtCWO](https://uni-duesseldorf.sciebo.de/s/QUtbyMFxxKPtCWO)

---



## Contents

1. Basic imports
2. Get the visual features of the CLIP model
3. Compute the k-NN similarity as the OOD score
4. Compute MSP using the text encoder and the label names
5. Linear probing on the pseudolabels
6. Mahalanobis distance as OOD score
7. Mahalanobis distance using the real labels without linear probing
8. K-means clusters combined with Mahalanobis distance

---

## Overview
We will apply the learned representations from Contrastive Language-Image Pretrained (CLIP) on the downstream task of out-of-distribution detection.

We will be using the model 'convnext_base_w' pretrained on 'laion2b_s13b_b82k' throughout this tutorial.

Info and examples on how to use CLIP models for inference is provided in [openclip](https://github.com/mlfoundations/open_clip#usage)

- [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
- [Contrastive Language-Image Pretrained (CLIP) Models are Powerful Out-of-Distribution Detectors](https://arxiv.org/abs/2303.05828)



# Part I. Basic imports

In [1]:
!wget -nc https://raw.githubusercontent.com/HHU-MMBS/RepresentationLearning_PUBLIC_2024/main/exercises/week08/utils.py

File ‘utils.py’ already there; not retrieving.



In [2]:
!pip install open_clip_torch timm



In [3]:
LOAD_DATA = True

In [4]:
import os
from pathlib import Path

import open_clip
import torch
import torch.nn.functional as F
import torchvision
import torchvision.transforms as T
from sklearn.cluster import KMeans
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from tqdm.auto import trange

models_dir = Path('./models/').resolve()
feats_dir = Path('./features/').resolve()
models_dir.mkdir(exist_ok=True)
feats_dir.mkdir(exist_ok=True)
# Local import
from utils import get_features, auroc_score, linear_eval, load_model

device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.mps.is_available() else 'cpu')

# Part II. Get the visual features of the CLIP model

- We will use `CIFAR100` as the in-distribution, and `CIFAR10` as the out-distribution.
- When you are only loading the visual CLIP backbone, you must remove the final linear layer that projects the features to the shared feature space of the image-text encoder.
- Load the data, compute the visual features and save them in the `features` folder.
- For the in-distribution you need both the train and test split, while for the out-distribution, we will only use the validation split.


### Optional structure

```python
def load_datasets(indist="CIFAR100", ood="CIFAR10", batch_size=256, tranform=None):
    # ....
    return indist_train_loader, indist_test_loader, ood_loader

# visual is a boolean that controls whether the visual backbone is only returned or the whole CLIP model
def get_model(visual, name, pretrained)
    # .....
    if visual:
        # ....
        return backbone, preprocess
    return model, preprocess, tokenizer
    
# Load everything .......

feats, labels = get_features(backbone, dl, device)
# Save features 
# ....
```

In [5]:
### START CODE HERE ### (≈ 30 lines of code)
def load_data(batch_size=256, indist='CIFAR100', ood='CIFAR10', transform=T.ToTensor()):
    ds = lambda name, train: getattr(torchvision.datasets, name)(root='~/data', train=train, download=True, transform=transform)
    dl = lambda name, train: DataLoader(ds(name, train), batch_size=batch_size, shuffle=train)

    return dl(indist, train=True), dl(indist, train=False), dl(ood, train=False)


def get_model(visual=False, name='convnext_base_w', pretrained='laion2b_s13b_b82k'):
    model, _, preprocess = open_clip.create_model_and_transforms(name, pretrained, device=device)
    if not visual: return model, preprocess, open_clip.get_tokenizer(name)

    backbone = model.visual
    backbone.head = nn.Identity()
    return backbone, preprocess

In [6]:
model, _ = get_model(True)

In [7]:
filenames = ['cifar100_train', 'cifar100_test', 'cifar10_test']
backbone, preprocess = get_model(True)
dls = load_data(transform=preprocess)
for name, dl in zip(filenames, dls):
    if LOAD_DATA and os.path.exists(feats_dir / f'{name}_feats.pt'): continue

    feats, labels = get_features(backbone, dl, device)
    torch.save(feats, feats_dir / f'{name}_feats.pt')
    torch.save(labels, feats_dir / f'{name}_labels.pt')
### END CODE HERE ###

In [8]:
# feature test
for name, N in [('cifar100_train', 50000), ('cifar100_test', 10000), ('cifar10_test', 10000)]:
    feats = torch.load(feats_dir / f'{name}_feats.pt')
    labels = torch.load(feats_dir / f'{name}_labels.pt')
    assert feats.shape == (N, 1024)
    assert labels.shape == (N,)
print('Success!')

Success!


# Part III. Compute the k-NN similarity as the OOD score

- For each test image of in and out distribution compute the top-1 cosine similarity and use it as OOD score.
- Report the resulting AUROC score.
- Note: Use the image features and not the images!

In [9]:
@torch.no_grad()
def OOD_classifier_knn(train_features, test_features, k=1, batch_size=1024):
    """
    Args:
        train_features: Tensor of shape NxD
        test_features: Tensor of shape KxD
        k: Number of closest train features to consider.
           Optional for this exercise

    Returns:
        cos_sim: Top-k cosine similarity, tensor of shape K

    """
    ### START CODE HERE ### (≈ 13 lines of code)
    B = batch_size
    K, _ = test_features.shape

    train_features = F.normalize(train_features, dim=-1)
    test_features = F.normalize(test_features, dim=-1)

    cos_sim = torch.empty(K, k, device=test_features.device, dtype=test_features.dtype)
    for start in trange(0, K, B):
        feat = test_features[start:start + B]
        sim = feat @ train_features.T
        cos_sim[start:start + B, :] = sim.topk(k)[0]
    ### END CODE HERE ###
    return cos_sim


# load the computed features and compute scores
### START CODE HERE ### (≈ 5 lines of code)
cifar100_train_feats = torch.load(feats_dir / 'cifar100_train_feats.pt').to(device)
cifar100_test_feats = torch.load(feats_dir / 'cifar100_test_feats.pt').to(device)
cifar10_test_feats = torch.load(feats_dir / 'cifar10_test_feats.pt').to(device)
score_in = OOD_classifier_knn(cifar100_train_feats, cifar100_test_feats).cpu()
score_out = OOD_classifier_knn(cifar100_train_feats, cifar10_test_feats).cpu()
### END CODE HERE ###
print(f'CIFAR100-->CIFAR10 AUROC: {auroc_score(score_in, score_out):.2f}')

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

CIFAR100-->CIFAR10 AUROC: 83.55


### Expected result

```
CIFAR100-->CIFAR10 AUROC: 83.55
```


# Part IV. Compute MSP using the text encoder and the label names

We will now consider the case where the in-distribution label names are available.

Your task is to apply zero-shot classification and get the maximum softmax probability (MSP) as the OOD score.

In short:
- compute image and text embeddings
- compute the image-test similarity matrix (logits)
- apply softmax to the logits for each image to get a probability distribution of the classes.
- compute maximum softmax probability (MSP)

- `Note`: After loading the saved image features you need to apply the linear projection layer from the visual backbone of CLIP

In [10]:
@torch.inference_mode()
def compute_msp(model, class_tokens, indist_test, ood_test, device):
    ### START CODE HERE ### (≈ 7 lines of code)
    scale = model.logit_scale.exp().item()

    def score(img_emb, text_emb):
        img_emb = F.normalize(img_emb, dim=-1)
        text_emb = F.normalize(text_emb, dim=-1)
        logits = (img_emb @ text_emb.T) * scale
        probs = logits.softmax(dim=-1)
        return probs.max(dim=-1).values.cpu()

    model.eval()
    model.to(device)

    class_tokens_emb = model.encode_text(class_tokens)

    indist_test = indist_test.to(device)
    ood_test = ood_test.to(device)

    score_in = score(indist_test, class_tokens_emb)
    score_out = score(ood_test, class_tokens_emb)
    ### END CODE HERE ###
    return score_in, score_out

In [11]:
# Load model and features
### START CODE HERE ### (≈ 4 lines of code)
model, _, tokenizer = get_model()
with torch.inference_mode():
    indist_train = model.visual.head(cifar100_train_feats.to(device))
    indist_test = model.visual.head(cifar100_test_feats.to(device))
    ood_test = model.visual.head(cifar10_test_feats.to(device))
### END CODE HERE ###

In [12]:
# model, tokenizer,indist_test, ood_test = ...... need to be defined above
### Provided
label_names = torchvision.datasets.CIFAR100(root='~/data', train=True, download=True).classes
prompts = ['an image of a ' + lab.replace('_', ' ') for lab in label_names]
class_tokens = tokenizer(label_names).to(device)
score_in, score_out = compute_msp(model, class_tokens, indist_test, ood_test, device)
print(f'CIFAR100-->CIFAR10 AUROC: {auroc_score(score_in, score_out):.2f}')

CIFAR100-->CIFAR10 AUROC: 76.38


### Expected result

```
CIFAR100-->CIFAR10 AUROC: 76.36
```


# Part V. Linear probing on the pseudolabels

- Your task is to train a linear layer on the CLIP embeddings using the CLIP pseudolabels as targets.
- The pseudolabels are the argmax of the logits computed above, i.e., take the class with the maximum probability as the class label

In [13]:
@torch.inference_mode()
def compute_score_probe_msp(lin_layer, indist_dl, ood_dl, device):
    """
    Computes the MSP scores for a linear layer for both in- and out- distribution.
    """

    ### START CODE HERE ### (≈ 4 lines of code)
    # dataset small enough to run in single pass
    score = lambda dl: lin_layer(dl.dataset.tensors[0]).softmax(dim=-1).max(dim=-1).values.cpu()

    lin_layer.eval()
    lin_layer.to(device)
    score_in, score_out = score(indist_dl), score(ood_dl)
    ### END CODE HERE ###
    return score_in, score_out

In [14]:
### START CODE HERE ### (≈ 17 lines of code)
@torch.inference_mode()
def load_pseudolabel_data(batch_size=256):
    model, _, tokenizer = get_model()
    model.eval()
    model.to(device)

    scale = model.logit_scale.exp().item()

    class_tokens_enc = model.encode_text(class_tokens)
    text_emb = F.normalize(class_tokens_enc, dim=-1)

    def calc_pseudolabel(img_emb):
        img_emb = F.normalize(img_emb, dim=-1)
        logits = (img_emb @ text_emb.T) * scale
        probs = logits.softmax(dim=-1)
        return probs.max(dim=-1).indices.cpu()

    train_ds = TensorDataset(indist_train, calc_pseudolabel(indist_train))
    val_ds = TensorDataset(indist_test, calc_pseudolabel(indist_test))

    train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    val_dl = DataLoader(val_ds, batch_size=batch_size)

    return train_ds, val_ds, train_dl, val_dl


train_ds, val_ds, train_dl, val_dl, = load_pseudolabel_data()
### END CODE HERE ###

In [15]:
# The code below is provided based on our implementation. Optional to use!
# Run linear probing
embed_dim = train_ds[0][0].size(0)
lin_layer = nn.Linear(embed_dim, 100).to(device)
optimizer = torch.optim.Adam(lin_layer.parameters(), lr=1e-3)
num_epochs = 20
dict_log = linear_eval(lin_layer, optimizer, num_epochs, train_dl, val_dl, device, prefix=models_dir / 'CLIP')
# compute MSP scores
lin_layer = load_model(models_dir / "CLIP_best_max_train_acc.pth", device)
ood_ds = TensorDataset(ood_test, torch.zeros(ood_test.shape[0], dtype=torch.long))
ood_dl = DataLoader(ood_ds, batch_size=128, shuffle=False, drop_last=False)
score_in, score_out = compute_score_probe_msp(lin_layer, val_dl, ood_dl, device)
print(f'CIFAR100-->CIFAR10 AUROC: {auroc_score(score_in, score_out):.2f}')

  0%|          | 0/20 [00:00<?, ?it/s]

Model /Users/Mac/Documents/Uni/Representation Learning/assignment_07/models/CLIP_best_max_train_acc.pth is loaded from epoch 19
CIFAR100-->CIFAR10 AUROC: 70.14


### Expected results

AUROC may slightly vary due to random initialization of linear probing.
```
CIFAR100-->CIFAR10 AUROC: 74.81
```

# Part VI. Mahalanobis distance as OOD score

For definitions of the (relative) Mahalanobis distance, see [A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection](https://arxiv.org/abs/2106.09022)
- Use the output of the linear layer from task 4 as features to compute the Mahalanobis distance and the relative Mahalanobis distance.
- To compute the Mahalanobis distance group the features by their pseudolabels and compute the mean and covariance matrix for each class.
- Only consider classes with at least two samples

In [16]:
torch.Tensor.cov

<method 'cov' of 'torch._C.TensorBase' objects>

In [17]:
@torch.inference_mode()
def OOD_classifier_maha(train_embeds_in, train_labels_in, test_embeds_in, test_embeds_outs, num_classes,
    relative=False):
    ### START CODE HERE ### (≈ 23 lines of code)
    labels = train_labels_in.unique().tolist()

    # i'm not sure if this is correct as it boosted all results a lot
    # but without it i sometimes get a auroc of 20-30, which doesn't make any sense...
    cov_stabilizer = 1e-5 * torch.eye(train_embeds_in.shape[1], device=train_embeds_in.device)

    x = {c: train_embeds_in[train_labels_in == c] for c in labels if (train_labels_in == c).sum() >= 2}

    mu = {c: v.mean(dim=0) for c, v in x.items()}
    inv_cov = {c: (v.T.cov() + cov_stabilizer).inverse() for c, v in x.items()}

    if relative:
        global_mu = train_embeds_in.mean(dim=0)
        global_inv_cov = (train_embeds_in.T.cov() + cov_stabilizer).inverse()

    def maha(x, mu_c, inv_cov_c):
        centered = x - mu_c
        return torch.einsum('bi,ij,bj->b', centered, inv_cov_c, centered)

    def score(x_test):
        dist = torch.stack([maha(x_test, mu[c], inv_cov[c]) for c in x]).min(dim=0).values
        return dist - maha(x_test, global_mu, global_inv_cov) if relative else dist

    scores_in = -score(test_embeds_in)
    scores_out = -score(test_embeds_outs)
    ### END CODE HERE ###
    return scores_in, scores_out


# The code below is provided based on our implementation. Optional to use!
num_classes = 100
lin_layer = load_model(models_dir / "CLIP_best_max_train_acc.pth", device)
logits_indist_train, indist_pseudolabels_train = get_features(lin_layer, train_dl, device)
logits_indist_test, indist_pseudolabels_test = get_features(lin_layer, val_dl, device)
logits_ood, _ = get_features(lin_layer, ood_dl, device)

# run OOD classifier based on mahalanobis distance
scores_in, scores_out = OOD_classifier_maha(logits_indist_train, indist_pseudolabels_train,
                                            logits_indist_test, logits_ood, num_classes, relative=False)
print(f'Maha: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_in, scores_out):.2f}')
scores_in, scores_out = OOD_classifier_maha(logits_indist_train, indist_pseudolabels_train,
                                            logits_indist_test, logits_ood, num_classes, relative=True)
print(f'Relative Maha: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_in, scores_out):.2f}')

Model /Users/Mac/Documents/Uni/Representation Learning/assignment_07/models/CLIP_best_max_train_acc.pth is loaded from epoch 19


  0%|          | 0/196 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

Maha: CIFAR100-->CIFAR10 AUROC: 81.80
Relative Maha: CIFAR100-->CIFAR10 AUROC: 83.22


### Expected results
(can differ based on linear probing performance)

```
Maha: CIFAR100-->CIFAR10 AUROC: 84.31
Relative Maha: CIFAR100-->CIFAR10 AUROC: 88.20
```

# Part VII. Mahalanobis distance using the real labels without linear probing
- Again, compute the (relative) Mahalanobis distance as OOD score
- This time, instead of using the pseudolabels and output of the linear probing layer, use the real labels of the training data and the features computed in task 1

In [18]:
### START CODE HERE ### (≈ 7 lines of code)
indist_labels = torch.load(feats_dir / 'cifar100_train_labels.pt')
scores_md_in, scores_md_out = OOD_classifier_maha(indist_train, indist_labels, indist_test, ood_test, num_classes, relative=True)
scores_rmd_in, scores_rmd_out = OOD_classifier_maha(indist_train, indist_labels, indist_test, ood_test, num_classes, relative=False)
### END CODE HERE ###
print(f'Maha: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_md_in, scores_md_out):.2f}')
print(f'Relative Maha: CIFAR100-->CIFAR10 AUROC: {auroc_score(scores_rmd_in, scores_rmd_out):.2f}')

Maha: CIFAR100-->CIFAR10 AUROC: 78.50
Relative Maha: CIFAR100-->CIFAR10 AUROC: 78.36


### Expected results
```
Maha: CIFAR100-->CIFAR10 AUROC: 69.43
Relative Maha: CIFAR100-->CIFAR10 AUROC: 64.32
```

# Part VIII. K-means clusters combined with Mahalanobis distance

The paper [SSD: A Unified Framework for Self-Supervised Outlier Detection](https://arxiv.org/abs/2103.12051) has proposed another unsupervised method for OOD detection. Instead of using the (real or pseudo) labels as class-wise means, we will now use the obtained clusters as found be kmeans. In more detail:

- Find k=10,50,100 clusters using Kmeans on the in-distribution training data (you can use the sklearn KMeans implementation).
- Get the cluster centers.
- Use them as class-wise means for the mahalanobis distance classifier.

In [19]:
# The code below is provided based on our implementation. Optional to use!
# load features - modify names if you use different names
indist_train = torch.load('features/cifar100_train_feats.pt')
indist_test = torch.load('features/cifar100_test_feats.pt')
ood_test = torch.load('features/cifar10_test_feats.pt')
results_md = []
results_rmd = []
for N in [10, 50, 100]:
    ### START CODE HERE ### (≈ 7 lines of code)
    kmeans = KMeans(n_clusters=N, random_state=42).fit(indist_train.cpu().numpy())
    train_cluster_labels = torch.from_numpy(kmeans.labels_).to(indist_train.device)
    scores_md_in, scores_md_out = OOD_classifier_maha(indist_train, train_cluster_labels, indist_test, ood_test, num_classes, relative=False)
    scores_rmd_in, scores_rmd_out = OOD_classifier_maha(indist_train, train_cluster_labels, indist_test, ood_test, num_classes, relative=True)
    auroc_md = auroc_score(scores_md_in, scores_md_out)
    auroc_rmd = auroc_score(scores_rmd_in, scores_rmd_out)
    ### END CODE HERE ###
    print(f'Kmeans (k={N}) + MD: CIFAR100-->CIFAR10 AUROC: {auroc_md:.2f}')
    print(f'Kmeans (k={N}) + RMD: CIFAR100-->CIFAR10 AUROC: {auroc_rmd:.2f}')
    print("-" * 100)

Kmeans (k=10) + MD: CIFAR100-->CIFAR10 AUROC: 76.44
Kmeans (k=10) + RMD: CIFAR100-->CIFAR10 AUROC: 71.02
----------------------------------------------------------------------------------------------------
Kmeans (k=50) + MD: CIFAR100-->CIFAR10 AUROC: 59.42
Kmeans (k=50) + RMD: CIFAR100-->CIFAR10 AUROC: 56.48
----------------------------------------------------------------------------------------------------
Kmeans (k=100) + MD: CIFAR100-->CIFAR10 AUROC: 47.35
Kmeans (k=100) + RMD: CIFAR100-->CIFAR10 AUROC: 46.53
----------------------------------------------------------------------------------------------------


### Expected results
Can differ based on KMeans performance.

```
Kmeans (k=10) + MD: CIFAR100-->CIFAR10 AUROC: 68.82
Kmeans (k=10) + RMD: CIFAR100-->CIFAR10 AUROC: 50.41
----------------------------------------------------------------------------------------------------
Kmeans (k=50) + MD: CIFAR100-->CIFAR10 AUROC: 68.18
Kmeans (k=50) + RMD: CIFAR100-->CIFAR10 AUROC: 46.16
----------------------------------------------------------------------------------------------------
Kmeans (k=100) + MD: CIFAR100-->CIFAR10 AUROC: 67.02
Kmeans (k=100) + RMD: CIFAR100-->CIFAR10 AUROC: 46.65
----------------------------------------------------------------------------------------------------
```