# Detecting Outliers with Cleanlab and PyTorch Image Models (timm)

This 5-minute quickstart tutorial shows how to detect potential outliers in image classification data using Cleanlab and PyTorch. The dataset used is `cifar10` which contains 60,000 images. Each image belongs to 1 of 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. 

**Overview of what we'll do in this tutorial:**

- Load the [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and do some basic data pre-processing.
- Create `trainset` and `testset` such that `testset` contains extra categories
- Load pretrained model and extract feature embeddings of `trainset` and `testset`
- Compute outlier scores for each example using cleanlab's `get_outlier_scores` method and analyze results.

## 1. Install the required dependencies
You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install torch
!pip install cleanlab
...
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
```

In [None]:
# Package installation (hidden on docs website).
# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)
# Package versions we used: matplotlib==3.5.1, numpy==1.21.6, torch==1.11.0, scikit-learn==1.0.2, torchvision==0.12.0, timm==0.5.4, cleanlab==2.0.0

dependencies = ['builtins','torch','torchvision','torchvision.transforms','numpy','matplotlib.pyplot','warnings','cleanlab','timm']

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    missing_dependencies = []
    for dependency in dependencies:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

Lets first import the required packages and set some seeds for reproducibility

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms

import numpy as np
import matplotlib.pyplot as plt
from pylab import rcParams

import warnings

import cleanlab
from cleanlab.rank import get_outlier_scores
from sklearn.metrics import precision_recall_curve
from sklearn.neighbors import NearestNeighbors # import KNN estimator
import timm # resnet50 pre-trained model

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.cuda.manual_seed_all(SEED)
warnings.filterwarnings("ignore", "Lazy modules are a new feature.*")

## 2. Fetch and scale the Cifar10 dataset

Import `cifar10` dataset. After some basic preprocessing, we manually remove some categories from the training examples thereby making them outliers in the test set. For this example we've chosen to remove all categories that are not an animal `[airplane, automobile, ship, truck]` from the training set `trainX`.

In [None]:
# Select how to load the cifar10 datasets. Load into tensors for training.
transform_normalize = transforms.Compose(
    [transforms.ToTensor(),
    ])

# Load cifar10 datasets
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform_normalize)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform_normalize)

# Manually remove non-animals out of the training dataset
animal_labels = [2,3,4,5,6,7]
trainy = trainset.targets # get labels
animal_idxs = np.where(np.isin(trainy, animal_labels))[0] # find idx of animals
trainset  = torch.utils.data.Subset(trainset, animal_idxs) # select only animals for the train set

# Check the shapes of our training and test sets
print('Trainset length: %s' % (len(trainset)))
print('Testset length: %s' % (len(testset)))

#### Lets visualize some of the training and test examples

In [None]:
txt_labels = {0: 'airplane', 
              1: 'automobile', 
              2: 'bird',
              3: 'cat', 
              4: 'deer', 
              5: 'dog', 
              6: 'frog', 
              7: 'horse', 
              8:'ship', 
              9:'truck'}

def imshow(img):
#     img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    return np.transpose(npimg, (1, 2, 0))

def plot_images(dataset):
    plt.rcParams["figure.figsize"] = (9,7)
    for i in range(15):
        X,y = dataset[i]
        ax = plt.subplot(3,5,i+1)
        ax.set_title(txt_labels[int(y)])
        ax.imshow(imshow(X))
    plt.show()

Observe how there are only animals left in the training set `trainset` below.

In [None]:
plot_images(trainset)

The test set on the other hand still visibily contains the non-animal images: `[ship, airplane, automobile, truck]`. If we consider `trainset` to be the representative of the normal data distribution then these non-animal images in test dataset `testset` become outliers.

In [None]:
plot_images(testset)

## 3. Import a model and get embeddings
The model we are importing comes from [timm](https://timm.fast.ai/), a deep-learning library collection of SOTA models and utilities. 

We pass in the images into the model to generate embeddings in the feature space that we require as inputs for the outlier detection algorithm. The model in this tutorial is a `resnet50` but outlier detection can be done with any method capable of generating feature embeddings.

In [None]:
# Download the model from timm
model = timm.create_model('resnet50', pretrained=True, num_classes=0)
model.eval()

# Create dataloaders for more efficient data streaming to the model
batch_size = 50
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=False, num_workers=2)

testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

In [None]:
# This cell can take ~1-2 mins
train_feature_embeddings = []

for data in trainloader:
    images, labels = data
    with torch.no_grad():
        feature_embeddings = model(images) # Generate feature embeddings of the training data using the model
        train_feature_embeddings.extend(feature_embeddings.numpy())
train_feature_embeddings = np.array(train_feature_embeddings)

print(f'Train embeddings pooled shape: {train_feature_embeddings.shape}')

In [None]:
# This cell can take ~1-2 mins
test_feature_embeddings = []

for data in testloader:
    images, labels = data
    with torch.no_grad():
        feature_embeddings = model(images) # Generate feature embeddings of the test data using the model
        test_feature_embeddings.extend(feature_embeddings.numpy())
test_feature_embeddings = np.array(test_feature_embeddings)
print(f'Test embeddings pooled shape: {test_feature_embeddings.shape}')

## 4. Use cleanlab to find outliers in the dataset
With just the feature embeddings, we can use the `cleanlab` library to try and find the artificially added outlier examples `[airplanes, automobiles, trucks, boats]` in the test dataset. We can also check the training examples to find any naturally occuring outliers.

In [None]:
# Import KNN estimator and fit it on the train feature embeddings
knn = NearestNeighbors(n_neighbors=20).fit(train_feature_embeddings)

# Get outlier scores for the test feature embeddings
outlier_scores = get_outlier_scores(features=test_feature_embeddings, knn=knn)

# Visualize top 15 outlier scores
top_outlier_idxs = (outlier_scores).argsort()[:15]
top_outlier_subset = torch.utils.data.Subset(testset, top_outlier_idxs)
plot_images(top_outlier_subset)

Notice how a lot of the outliers in `testset` belong to the holdout classes `[airplane, automobile, ship]`. These feature representations are futher away in the model representation space than the feature representations of animal images also found in `trainset`. The other outlier examples are the stranger, more out of distribution pictures of animals.

Just for fun, lets visualize what the `NearestNeighbors` algorithm considers the 15 least probable outliers in our test set. Notice there are a lot less images from the out of distribution classes here and all the images are visually similar to each other.

In [None]:
# visualize least probable 15 outlier scores
bottom_outlier_idxs = (-outlier_scores).argsort()[:15]
bottom_outlier_subset = torch.utils.data.Subset(testset, bottom_outlier_idxs)
plot_images(bottom_outlier_subset)

We can also compute the precision/recall curve of our algorithm for the examples.

In [None]:
animal_labels = [2,3,4,5,6,7] # identify animal labels in the testing dataset
animal_idxs = np.where(np.isin(testset.targets, animal_labels))[0] # find idx of animals
not_outlier = np.zeros(len(testset.targets), dtype=bool)
not_outlier[animal_idxs] = True
precision, recall, thresholds = precision_recall_curve(not_outlier, 1 - outlier_scores)

In [None]:
plt.plot(recall, precision)
plt.xlabel("Recall", fontsize=14)
plt.ylabel("Precision", fontsize=14)
plt.show()

### Finding naturally occuring outlier examples

We can also use ``get_outlier_scores()`` to find outlier examples in our training dataset. These examples should be animal images that are strange or different.

In [None]:
# get outlier scores for our train feature embeddings
outlier_scores = get_outlier_scores(features=train_feature_embeddings)

# visualize top 15 outlier scores
top_train_outlier_idxs = (outlier_scores).argsort()[:15]
top_train_outlier_subset = torch.utils.data.Subset(trainset, top_train_outlier_idxs)
plot_images(top_train_outlier_subset)

Just for fun, lets see what our model considers the least likeley outliers in the training set! These examples should be very homogeneous.

In [None]:
# visualize bottom 15 outlier scores on train set
bottom_train_outlier_idxs = (-outlier_scores).argsort()[:15]
bottom_train_outlier_subset = torch.utils.data.Subset(trainset, bottom_train_outlier_idxs)
plot_images(bottom_train_outlier_subset)

In [None]:
# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

top_outlier_idxs_test = [9834, 3820, 2381, 4810, 3291, 6352, 7488, 7368, 4587, 9965, 6927, 2719, 9831, 2130, 9175]
top_outlier_idxs_train = [26999, 20198,  7967, 29061, 16684,  2558, 23072,   454,  5815, 9967, 27499, 22507,  8488, 20685,  9880]

if not all(x in top_outlier_idxs for x in top_outlier_idxs_test):
    raise Exception("Some highlighted examples are missing from top outliers in test set.")

if not all(x in top_train_outlier_idxs for x in top_outlier_idxs_train):
    raise Exception("Some highlighted examples are missing from bottom test set outliers.")