# Detecting Outliers with Cleanlab and PyTorch Image Models (timm)

This 5-minute quickstart tutorial shows how to detect potential outliers in image classification data using Cleanlab and PyTorch. The dataset used is `cifar10` which contains 60,000 images. Each image belongs to 1 of 10 categories: `[airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck]`. 

**Overview of what we'll do in this tutorial:**

- Load the [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and do some basic data pre-processing.
- Create `trainset` and `testset` such that `trainset` only contains animals and `testset` contains all categories.
- Load pretrained `timm` model and extract `trainset` and `testset` feature embeddings.
- Use `cleanlab` to find naturally occuring outlier examples in the `trainset`.
- Use `cleanlab` to find outlier examples (non-animals) in the `testset`.
- Explore threshold selection for labeling outliers

## 1. Install the required dependencies
You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install torch
!pip install cleanlab
...
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
```

In [None]:
# Package installation (hidden on docs website).
# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)
# Package versions we used: matplotlib==3.5.1, numpy==1.21.6, torch==1.11.0, scikit-learn==1.0.2, torchvision==0.12.0, timm==0.5.4, cleanlab==2.0.0

dependencies = ['builtins','torch','torchvision','torchvision.transforms','numpy','matplotlib.pyplot','warnings','cleanlab','timm']

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    missing_dependencies = []
    for dependency in dependencies:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

Lets first import the required packages and set some seeds for reproducibility

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms

import numpy as np
import matplotlib.pyplot as plt
from pylab import rcParams

import warnings

import cleanlab
from cleanlab.rank import get_outlier_scores
from sklearn.metrics import precision_recall_curve
from sklearn.neighbors import NearestNeighbors # import KNN estimator
import timm # resnet50 pre-trained model

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.cuda.manual_seed_all(SEED)
warnings.filterwarnings("ignore", "Lazy modules are a new feature.*")

## 2. Fetch and scale the Cifar10 dataset

Import `cifar10` dataset. After some basic preprocessing, we manually remove some categories from the training examples thereby making them outliers in the test set. For this example we've chosen to remove all categories that are not an animal `[airplane, automobile, ship, truck]` from the training set `trainX`.

In [None]:
# Select how to load the cifar10 datasets. Load into tensors for training.
transform_normalize = transforms.Compose(
    [transforms.ToTensor(),
    ])

# Load cifar10 datasets
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform_normalize)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform_normalize)

# Manually remove non-animals out of the training dataset
animal_labels = [2,3,4,5,6,7]
trainy = trainset.targets # Get labels
animal_idxs = np.where(np.isin(trainy, animal_labels))[0]
trainset  = torch.utils.data.Subset(trainset, animal_idxs) # Select subset of animals for our trainset

# Check the shapes of our training and test sets
print('Trainset length: %s' % (len(trainset)))
print('Testset length: %s' % (len(testset)))

#### Lets visualize some of the training and test examples

In [None]:
txt_labels = {0: 'airplane', 
              1: 'automobile', 
              2: 'bird',
              3: 'cat', 
              4: 'deer', 
              5: 'dog', 
              6: 'frog', 
              7: 'horse', 
              8:'ship', 
              9:'truck'}

def imshow(img):
#     img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    return np.transpose(npimg, (1, 2, 0))

def plot_images(dataset):
    plt.rcParams["figure.figsize"] = (9,7)
    for i in range(15):
        X,y = dataset[i]
        ax = plt.subplot(3,5,i+1)
        ax.set_title(txt_labels[int(y)])
        ax.imshow(imshow(X))
    plt.show()

Observe how there are only animals left in the training set `trainset` below.

In [None]:
plot_images(trainset)

The test set on the other hand still visibily contains the non-animal images: `[ship, airplane, automobile, truck]`. If we consider `trainset` to be the representative of the normal data distribution then these non-animal images in `testset` become outliers.

In [None]:
plot_images(testset)

## 3. Import a model and get embeddings
We pass in the images into the model to generate embeddings in the feature space that we require as inputs for the outlier detection algorithm. 

The model we are importing is a `resnet50` that comes from [timm](https://timm.fast.ai/), a deep-learning library collection of SOTA models and utilities but outlier detection can be done with any method capable of generating feature embeddings.

In [None]:
# Download the model from timm and set to eval mode
model = timm.create_model('resnet50', pretrained=True, num_classes=0)
model.eval()

In [None]:
# Create dataloaders for more efficient data streaming to the model
batch_size = 50
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=False, num_workers=2)

testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

In [None]:
# Generate feature embeddings of the training data using the model
# This cell can take several mins
train_feature_embeddings = []

for data in trainloader:
    images, labels = data
    with torch.no_grad():
        feature_embeddings = model(images)
        train_feature_embeddings.extend(feature_embeddings.numpy())
train_feature_embeddings = np.array(train_feature_embeddings)

print(f'Train embeddings pooled shape: {train_feature_embeddings.shape}')

In [None]:
# Generate feature embeddings of the test data using the model
# This cell can take several mins
test_feature_embeddings = []

for data in testloader:
    images, labels = data
    with torch.no_grad():
        feature_embeddings = model(images)
        test_feature_embeddings.extend(feature_embeddings.numpy())
test_feature_embeddings = np.array(test_feature_embeddings)
print(f'Test embeddings pooled shape: {test_feature_embeddings.shape}')

## 4. Use cleanlab to find outliers in the dataset
With just the feature embeddings and ``cleanlab`` we can now identify outliers in the data. 

### Finding naturally occuring outlier examples in the trainset

Calling ``get_outlier_score()`` on ``train_feature_embeddings`` will find any naturally occuring outliers in ``trainset``. These examples should be animal images that look strange or different from the majority of other animal images in the data. It is not often you come across a cat wearing a turquoise beanie!

In [None]:
# Get outlier scores for our train feature embeddings
train_outlier_scores = get_outlier_scores(features=train_feature_embeddings)

# Visualize top 15 most likeley outlier scores
top_train_outlier_idxs = (train_outlier_scores).argsort()[:15]
top_train_outlier_subset = torch.utils.data.Subset(trainset, top_train_outlier_idxs)
plot_images(top_train_outlier_subset)

Just for fun, lets see what our model considers the least likeley outliers in the training set! These examples look very homogeneous as they are very close together in the feature space to many of their neighbors.

In [None]:
# Visualize 15 least probable outlier scores on trainset
bottom_train_outlier_idxs = (-train_outlier_scores).argsort()[:15]
bottom_train_outlier_subset = torch.utils.data.Subset(trainset, bottom_train_outlier_idxs)
plot_images(bottom_train_outlier_subset)

### Finding outlier examples in the testset
We can also use ``get_outlier_score()`` to find the artificially added outlier classes `[airplanes, automobiles, trucks, boats]` into the test dataset.

We will begin with creating an ``sklearn.neighbors.NearestNeighbor`` estimator to fit on the ``train_feature_embeddings`` (i.e. the train dataset). The function can be called without passing in a ``knn`` object in which case one will be created internally.

In [None]:
# Import KNN estimator and fit it on the train feature embeddings
knn = NearestNeighbors(n_neighbors=20,metric="cosine").fit(train_feature_embeddings)

# Get outlier scores for the test feature embeddings
test_outlier_scores = get_outlier_scores(features=test_feature_embeddings, knn=knn)

<div class="alert alert-info">
Not sure which distance metric to use?
    

By default `sklearn.neighbors.NearestNeighbor` uses` minkowski` distance, but we generally recommend using `cosine` distance instead when computing distances between neural net representations of data. Internally, `get_outlier_score()` uses sklearn's KNN based on `cosine` distance already.
</div>

In [None]:
# Visualize top 15 most likeley outlier scores
top_outlier_idxs = (test_outlier_scores).argsort()[:15]
top_outlier_subset = torch.utils.data.Subset(testset, top_outlier_idxs)
plot_images(top_outlier_subset)

Notice how all of shown outliers identified in `testset` belong to the holdout classes `[airplane, automobile, ship, truck]`. These feature representations are futher away in the feature space than the feature representations of animal images also found in `testset`.

Just for fun, lets visualize what the `NearestNeighbors` algorithm considers the 15 least probable outliers in our test set.

In [None]:
# Visualize 15 least likeley to be outlier scores
bottom_outlier_idxs = (-test_outlier_scores).argsort()[:15]
bottom_outlier_subset = torch.utils.data.Subset(testset, bottom_outlier_idxs)
plot_images(bottom_outlier_subset)

Notice there are a lot less images from the out of distribution classes here and all the images are visually similar to each other. Even the shon ``automobile`` and ``airplane`` examples look similar to their animal conterparts.

We can also compute the precision/recall curve of our algorithm for the examples.

In [None]:
animal_labels = [2,3,4,5,6,7] # Animal labels to identify in the dataset
animal_idxs = np.where(np.isin(testset.targets, animal_labels))[0] # Find animal idxs
not_outlier = np.zeros(len(testset.targets), dtype=bool)
not_outlier[animal_idxs] = True
precision, recall, thresholds = precision_recall_curve(not_outlier, 1 - test_outlier_scores)

In [None]:
plt.plot(recall, precision)
plt.xlabel("Recall", fontsize=14)
plt.ylabel("Precision", fontsize=14)
plt.show()

## 4. Thresholding outliers in outlier_scores

Now that we know how to find the outlier scores, how do we determine how many of the lowest ranked indices in ``testset`` should be marked as outliers? We can use the `train_outlier_scores` distribution to calcualte a threshold for the `testset`.

If we want to select a hard threshold for outlier detection on the future test data that gives us around 5% false positives. We can look at the distribution of outlier_scores for the `trainset` (assuming it has no outliers) and use the 5-th percentile of this distribution as the threshold below which to call a test example an outlier.

Lets first take a look at our score distributions and see where the 5th precentile falls (along red line).

In [None]:
# Calculate 5th percentile of the trainset distribution
fifth_percentile = np.percentile(train_outlier_scores, 5)

# Plot outlier_score distributions and the 5th percentile cutoff
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
plt_range = [min(train_outlier_scores.min(),test_outlier_scores.min()), \
             max(train_outlier_scores.max(),test_outlier_scores.max())]
axes[0].hist(train_outlier_scores, range=plt_range, bins=50)
axes[0].set(title='train_outlier_scores distribution', ylabel='Frequency')
axes[0].axvline(x=fifth_percentile, color='red', linewidth=2)
axes[1].hist(test_outlier_scores, range=plt_range, bins=50)
axes[1].set(title='test_outlier_scores distribution', ylabel='Frequency')
axes[1].axvline(x=fifth_percentile, color='red', linewidth=2)

plt.show()

Everything to the left of the red line in the distribution of `test_outlier_scores` will be marked as an outlier. Let's measure how well this works. 

**Visually check the threshold**

First lets plot the least sure outliers of our `testset`. These are the images immediately to the left of that cuttoff line. As you can see majority of them are still true outliers however there are a few less standard looking animals that are now falseley identified as outliers as well.

In [None]:
# Visualize 15 outlier scores right along the cuttoff (i.e. the least likeley outliers given threshold)
outlier_idxs = test_outlier_scores.argsort()
outlier_scores = test_outlier_scores[outlier_idxs]
selected_outlier_idxs = outlier_idxs[outlier_scores < fifth_percentile]

selected_outlier_subset = torch.utils.data.Subset(testset, selected_outlier_idxs[::-1])
plot_images(selected_outlier_subset)

**Empirically measure threshold effectiveness**

Setting the hard threshold to the 5-th percentile of the `trainset` gives us almost exactly a 5% false positive rate. If that is what we are looking for, this is an effective threshold cuttoff for this data distribution.

In [None]:
animal_labels = [2,3,4,5,6,7] # Animal labels in the dataset
animal_idxs = np.where(np.isin(testset.targets, animal_labels))[0]

false_positive_idxs = set(selected_outlier_idxs).intersection(set(animal_idxs))
FPR = len(false_positive_idxs) / (len(false_positive_idxs) + len(animal_idxs))
print(f'Number of false positives detected: {len(false_positive_idxs)}\nFalse positive rate: {round(FPR,4)} ')

In [None]:
# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

top_outlier_idxs_test = [1579, 4373, 9898, 3485, 2701, 3768, 7033, 8413, 7217, 8311, 5814, 3789, 3740, 2059, 5056]
top_outlier_idxs_train = [26999, 20198,7967,29061,16684,2558,23072,454,5815,9967,27499,22507,8488,20685,9880]

if not all(x in top_outlier_idxs for x in top_outlier_idxs_test):
    raise Exception("Some highlighted examples are missing from top outliers in test set.")

if not all(x in top_train_outlier_idxs for x in top_outlier_idxs_train):
    raise Exception("Some highlighted examples are missing from bottom test set outliers.")