# Outlier detection in cifar10 dataset using ``Annoy`` and ``Autogluon``

This tutorial will show you how to extend outliers detection with Cleanlab's ``get_outlier_scores()`` function to better KNN implementations outside of the ``sklearn`` library. This tutorial focuses on using a subclass of ``sklearn.neighbors.NearestNeighbors`` model, ``Annoy`` but any subclass works as long as ``NearestNeighbors`` can return an array of nearest neighbor distances. This can be done with cleanlab's ``get_outlier_scores()`` function, which takes in the following parameters:

- Feature array of shape ``(N, M)``, where N is the number of examples and M is the number of features used to represent each example.
- The ``sklearn.neighbors.NearestNeighbors`` object or subclass of ``sklearn.neighbors.NearestNeighbors`` that's been fitted on a dataset in the same feature space.
- ``k`` the number of neighbors and ``t`` the rescaling factor for the outlier scores

## 1. Load packages and set seeds for reprodusability

In [1]:
import torch
import torchvision
import torchvision.transforms as transforms

import numpy as np
import matplotlib.pyplot as plt
from pylab import rcParams

import warnings

import cleanlab
from cleanlab.rank import get_outlier_scores
from sklearn.metrics import precision_recall_curve
# from sklearn.neighbors import NearestNeighbors # import KNN estimator
import timm # resnet50 pre-trained model
from annoy import AnnoyIndex

SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.cuda.manual_seed_all(SEED)
warnings.filterwarnings("ignore", "Lazy modules are a new feature.*")

  from .autonotebook import tqdm as notebook_tqdm
  warn(f"Failed to load image Python extension: {e}")


ImportError: cannot import name 'get_outlier_scores' from 'cleanlab.rank' (/home/ulyana/virtual/multi/lib/python3.8/site-packages/cleanlab/rank.py)

## 2. Fetch and scale the Cifar10 dataset

In [2]:
# Select how to load the cifar10 data. Load into tensors for training and normalize range 0-1
transform_normalize = transforms.Compose(
    [transforms.ToTensor()])

# Load cifar10 datasets
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform_normalize)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform_normalize)

# Manually remove non-animals out of the training dataset
animal_labels = [2,3,4,5,6,7]
trainy = trainset.targets # get labels
animal_idxs = np.where(np.isin(trainy, animal_labels))[0] # find idx of animals
trainset  = torch.utils.data.Subset(trainset, animal_idxs) # select only animals for the train set

# Check the shapes of our training and test sets
print('Trainset length: %s' % (len(trainset)))
print('Testset length: %s' % (len(testset)))

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████████████████████| 170498071/170498071 [00:02<00:00, 71762021.24it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Trainset length: 30000
Testset length: 10000


In [4]:
# Create dataloaders for more efficient data streaming to the model
batch_size = 50

trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=False, num_workers=2)

testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

## 3. Create and train a model
The model we are creating is a Swin Transformer using the ``autogluon`` image predictor library.

We pass in the images into the model to generate embeddings in the feature space that we require as inputs for the outlier detection algorithm. The model in this tutorial comes from``autogluon.vision`` but outlier detection can be done with any method capable of generating feature embeddings.

In [5]:
from autogluon.vision import ImagePredictor, ImageDataset

# Train and save trained model-- only works on remote server!
model_name = 'swin_base_patch4_window7_224' # or resnet50

# init model
model = ImagePredictor(verbosity=2)

# train model [mini_train vs full train + fix epochs and time_limit]
model.fit(
    train_data=trainloader,
    ngpus_per_trial=1,
    hyperparameters={"holdout_frac": 0.2, "model": model_name},
    time_limit=18000,
    random_state=12345,
)

ImagePredictor sets accuracy as default eval_metric for classification problems.


TypeError: Unable to process dataset of type: <class 'torch.utils.data.dataloader.DataLoader'>

## 4. Get model embeddings

In [None]:
# This cell can take ~1-2 mins
train_feature_embeddings = []

for data in trainloader:
    images, labels = data
    feature_embeddings = model(images) # Generate feature embeddings of the training data using the model
    train_feature_embeddings.extend(feature_embeddings.detach().numpy())
train_feature_embeddings = np.array(train_feature_embeddings)
print(f'Train embeddings pooled shape: {train_feature_embeddings.shape}')