# Detecting Outliers with Cleanlab and Sklearn

This 5-minute quickstart tutorial shows how to detect potential outliers in image classification data. The dataset used is `cifar10` which contains 60,000 images. Each image belongs to 1 of 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. 

**Overview of what we'll do in this tutorial:**

- Load the [cifar10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and do some basic data pre-processing.
- Create `trainX` and `testX` such that `testX` contains extra categories
- Load pretrained model and extract feature embeddings of `trainX` and `testX`
- Compute outlier scores for each example using cleanlab's `get_outlier_scores` method and analyze results.

## 1. Install the required dependencies
You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install torch
!pip install cleanlab
...
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
```

In [None]:
# Package installation (hidden on docs website).
# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)

dependencies = ["cleanlab", "matplotlib", "torch", "pylab", "keras", "sklearn", "timm", ]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    missing_dependencies = []
    for dependency in dependencies:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

Lets first set some seeds for reproducibility

In [None]:
import numpy as np
import torch
import warnings

SEED = 123
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.cuda.manual_seed_all(SEED)
warnings.filterwarnings("ignore", "Lazy modules are a new feature.*")

## 2. Fetch and scale the Cifar10 dataset

Import `cifar10` dataset. After some basic preprocessing, we manually remove some categories from the training data thereby making them outliers in the test set. For this example we've chosen to remove all categories that are not an animal `[airplane, automobile, ship, truck]` from the training set `trainX`.

In [None]:
import matplotlib.pyplot as plt
from pylab import rcParams

from keras.datasets import cifar10, cifar100
from sklearn.model_selection import train_test_split

In [None]:
# Load cifar10 datasets
(trainX, trainy), (testX, testy) = cifar10.load_data()
trainX, _, trainy, _ = train_test_split(trainX, trainy, test_size=0.5, stratify=trainy, random_state=SEED) # only use 50% of each label

# Convert from integers to floats and normalize range 0-1
trainX = trainX.astype('float32') / 255.0
testX = testX.astype('float32') / 255.0

# Manually remove non-animals out of the training dataset
animal_labels = [2,3,4,5,6,7]
animal_idxs = np.where(np.isin(trainy, animal_labels))[0] # find idx of animals
trainX = trainX[animal_idxs]
trainy = trainy[animal_idxs]

# Check the shapes of our training and test sets
print('Train: X=%s, y=%s' % (trainX.shape, trainy.shape))
print('Test: X=%s, y=%s' % (testX.shape, testy.shape))

#### Lets visualize some of the training and test examples

In [None]:
txt_labels = {0: 'airplane', 
              1: 'automobile', 
              2: 'bird',
              3: 'cat', 
              4: 'deer', 
              5: 'dog', 
              6: 'frog', 
              7: 'horse', 
              8:'ship', 
              9:'truck'}

def plot_images(X,y):
    plt.rcParams["figure.figsize"] = (9,7)
    for i in range(15):
        # define subplot
        ax = plt.subplot(3,5,i+1)
        ax.set_title(txt_labels[int(y[i])])
        # plot raw pixel data
        ax.imshow(X[i])
    # show the figure
    plt.show()

Observe how there are only animals left in the training set `trainX` below.

In [None]:
plot_images(trainX,trainy)

The test set on the other hand still visibily contains the non-animal images: `[ship, airplane, automobile, truck]`. If we consider `trainX` to be the representation of data, then these non-animal images in test dataset `testX` become outliers.

In [None]:
plot_images(testX,testy)

## 3. Import a model and embeddings
The model we are importing comes from [timm](https://timm.fast.ai/), a deep-learning library collection of SOTA models and utilities. We pass in the images into the model to generate embeddings in the feature space used in outlier detection.

The model itself is a `resnet50` but outlier detection can be done with any object capable of generating feature embeddings.

In [None]:
%%time
import timm

# This cell can take ~1-2 mins
model = timm.create_model('resnet50', pretrained=True, num_classes=0) # download the model from timm
trainX_torch = torch.from_numpy(trainX.swapaxes(1,3)) # turn trainX into tensor and fix channel dimension
train_feature_embeddings = model(trainX_torch) # fit model to train images and get train feature embeddings
train_feature_embeddings = train_feature_embeddings.detach().numpy() # change type to numpy array
print(f'Pooled shape: {train_feature_embeddings.shape}')

In [None]:
%%time

# This cell can take ~1-2 mins
model = timm.create_model('resnet50', pretrained=True, num_classes=0) # download the model from timm
testX_torch = torch.from_numpy(testX.swapaxes(1,3)) # turn testX into tensor and fix channel dimension
test_feature_embeddings = model(testX_torch) # fit model to test images and get test feature embeddings
test_feature_embeddings = test_feature_embeddings.detach().numpy() # change type to numpy array
print(f'Pooled shape: {test_feature_embeddings.shape}')

## 4. Use cleanlab to find outliers in the dataset
With just the feature embeddings, we can use the `cleanlab` library to try and find the artificially added outlier examples `[airplanes, automobiles, trucks, boats]` in the test dataset. We can also check the training data to find any naturally occuring outlier examples in the dataset.

In [None]:
import cleanlab
from cleanlab.rank import get_outlier_scores
from sklearn.neighbors import NearestNeighbors # import KNN estimator

In [None]:
# import KNN estimator and fit it on the resnet50 feature embeddings
knn = NearestNeighbors(n_neighbors=10).fit(train_feature_embeddings)

# get outlier scores for the test feature embeddings
outlier_scores = get_outlier_scores(features=test_feature_embeddings, knn=knn, k=10)

# visualize top 15 outlier scores
top_outlier_idxs = (outlier_scores).argsort()[:15]
plot_images(testX[top_outlier_idxs],testy[top_outlier_idxs])

Notice how majority of the outliers belong to the holdout classes (airplane, automobile, ship). These feature representations are futher away in the model representation space than the training dataset representations. 

Just for fun, lets visualize what the NearestNeighbors algorithm considers the 15 least probable outliers in our test set. Notice there are a lot less images from the out of distribution classes here.

In [None]:
# visualize least probable 15 outlier scores
bottom_outlier_idxs = (-outlier_scores).argsort()[:15]
plot_images(testX[bottom_outlier_idxs],testy[bottom_outlier_idxs])

We can also compute the precision/recall of our algorithm for the examples.

In [None]:
from sklearn.metrics import precision_recall_curve
animal_labels = [2,3,4,5,6,7] # identify animal labels in the testing dataset
animal_idxs = np.where(np.isin(testy, animal_labels))[0] # find idx of animals
not_outlier = np.zeros(len(testy), dtype=bool) # is outlier
not_outlier[animal_idxs] = True
precision, recall, thresholds = precision_recall_curve(not_outlier, 1 - outlier_scores)

In [None]:
plt.plot(recall, precision)
plt.xlabel("Recall", fontsize=14)
plt.ylabel("Precision", fontsize=14)
plt.show()

### Finding naturally occuring outlier examples

In [None]:
# get outlier scores for our train feature embeddings
outlier_scores = get_outlier_scores(features=train_feature_embeddings, k=10)

# visualize top 15 outlier scores
top_train_outlier_idxs = (outlier_scores).argsort()[:15]
plot_images(trainX[top_train_outlier_idxs],trainy[top_train_outlier_idxs])

Just for fun, lets see what our model considers the least likeley outliers in the training set! These examples are very homogeneous.

In [None]:
# visualize bottom 12 outlier scores on train set
top_outlier_idxs = (-outlier_scores).argsort()[:15]
plot_images(trainX[top_outlier_idxs],trainy[top_outlier_idxs])

In [None]:
# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

top_outlier_idxs_test = [ 1280, 13254,  6725, 13708, 11397, 11276,  4115,  6938, 14509, 9438,  9832,  3589, 12862,  8287,  4846]
bottom_outlier_idxs_test =[6165, 5020, 8466, 4322, 1914, 5113, 6697, 7608,  273, 7802, 7795,4325, 6125,  910, 8448]

if not all(x in top_outlier_idxs for x in top_outlier_idxs_test):
    raise Exception("Some highlighted examples are missing from top outliers in test set.")

if not all(x in bottom_outlier_idxs for x in bottom_outlier_idxs_test):
    raise Exception("Some highlighted examples are missing from bottom test set outliers.")