# Demo with CIFAR10 data and image preview of embedded data points
For this, we need to have CIFAR10 data organized in subfolders (one for each class). We can then use the standard Aggregator, which returns an image and the corresponding filename.
We use the image data to extract the embeddings from a pretrained Resnet, and reduce the dimensionality further with UMAP down to just 2 dimensions.
Then, we use a scatter plot that additionally plots the corresponding images when we hover over a data point with the mouse pointer.

# Prerequisites

In [1]:
import numpy as np
from pathlib import Path
from collections import Counter

import torch
from torchvision.datasets import CIFAR10
from torchvision.transforms import Compose, ToTensor, Grayscale, Normalize, Resize
from torchvision.models import resnet18

from hyperpyper.utils import IndexToClassLabelDecoder, ClassToIndexLabelDecoder, FileToClassLabelDecoder
from hyperpyper.utils import DataSetDumper, VisionDatasetDumper
from hyperpyper.utils import FolderScanner as fs
from hyperpyper.utils import Pickler
from hyperpyper.utils import EmbeddingPlotter
from hyperpyper.utils import PipelineCache
from hyperpyper.utils import PathList
from hyperpyper.transforms import (FileToPIL,
                            DummyPIL,
                            PILToNumpy,
                            FlattenArray,
                            DebugTransform,
                            ProjectTransform,
                            PyTorchOutput,
                            PyTorchEmbedding,
                            ToDevice,
                            FlattenTensor,
                            CachingTransform)
from hyperpyper.aggregator import DataAggregator, DataSetAggregator

import matplotlib.pyplot as plt

import plotly.graph_objs as go
import plotly.express as px
import ipywidgets as widgets

from IPython.display import display


random_state = 23

In [2]:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Running on device:", DEVICE.upper())

Running on device: CUDA


In [3]:
ROOT_PATH = Path.home() / "Downloads" / "data"

DATA_PATH = ROOT_PATH / "CIFAR10"

DATA_PATH_TEST = Path(DATA_PATH, "test")
DATA_PATH_TRAIN = Path(DATA_PATH, "train")

CACHE_PATH = DATA_PATH / "tmp"

### Define a folder for UMAP specific cache, a filename is automatically generated based on the data that is processed with the pipeline.

In [4]:
UMAP_2D_CACHE_PATH = CACHE_PATH / "umap2d"

### Define a file for the caching of the embedding vectors.

In [5]:
CIFAR10_train_embedding_resnet18_file = Path(CACHE_PATH, "CIFAR10_train_embedding_resnet18.pkl")

### Let's see if there are any cache files present already

In [6]:
cache_file = fs.get_files(CACHE_PATH, recursive=True, relative_to=CACHE_PATH)
cache_file

[]

## Create CIFAR10 dataset organized in subfolders indicating class
The VisionDatasetDumper handles the download and the creation of a folder structure where images are stored. They can then be used as the starting point for experiments. We only need the dataset returned by the VisionDatasetDumper to extract the class labels to be able to match them with class indices.

In [7]:
train_dataset = VisionDatasetDumper(CIFAR10, root=DATA_PATH, dst=DATA_PATH_TRAIN, train=True).dump()

Files already downloaded and verified


In [8]:
train_dataset.classes

['airplane',
 'automobile',
 'bird',
 'cat',
 'deer',
 'dog',
 'frog',
 'horse',
 'ship',
 'truck']

### Retrieve a list of files

In [9]:
train_files = fs.get_files(DATA_PATH_TRAIN, extensions='.png', recursive=True)

len(train_files)

50000

### Load a pre-trained PyTorch model

In [10]:
weights_pretrained = torch.load("weights_resnet18_cifar10.pth", map_location=DEVICE)

# load model with pre-trained weights
model = resnet18(num_classes=10)
model.load_state_dict(weights_pretrained)

<All keys matched successfully>

## Define Transformation pipeline
Notice, that we have a FileToPIL Transformation that handles the loading of the image. This enables us to use the standard Aggregator, where we don't need to take care of a DataSet or DataLoader instantiation.
All we need to pass as arguments are a file list and the transformation pipeline, and optionally a batch size.

In [11]:
# Create the transformation pipeline
transform_pipeline = Compose([
    FileToPIL(),
    ToTensor(),
    Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ToDevice(DEVICE),
    PyTorchEmbedding(model, device=DEVICE),
    ToDevice('cpu'),
    FlattenTensor(),
])

## Instantiate Aggregator and extract embeddings
The result follows the torchvision standard of being (X,y) tuples. Notice, y corresponds to files, and not to a target. The reason for this is that we want to keep data and the corresponding files together, even in case the mini batch procedure shuffles the order. The files are later used to connect a UMAP data point to the corresponding image.

In [12]:
agg = DataAggregator(files=train_files, transforms=transform_pipeline, batch_size=320)

train_X, train_y_files = agg.transform(cache_file=CIFAR10_train_embedding_resnet18_file)

### Since transform() has been executed, a cache file has been created (in case it was not already there)

In [13]:
cache_file = fs.get_files(CACHE_PATH, recursive=True, relative_to=CACHE_PATH)
cache_file

[WindowsPath('CIFAR10_train_embedding_resnet18.pkl')]

### Define the pipeline to feed the embedding vectors to a UMAP dimensionality reducer

In [14]:
from umap import UMAP
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('umap', UMAP()),
])

### Wrap a PipelineCache around the pipeline to automatically cache the output of the pipeline

In [15]:
pipeline = PipelineCache(pipeline, cache_path=UMAP_2D_CACHE_PATH)

In [16]:
train_reduced_embedding = pipeline.fit_transform(train_X)
train_reduced_embedding.shape

(50000, 2)

### After fit(), transform(), or fit_transform(), a cache file is created that corresponds to the data processed with the pipeline
Each function will yield a corresponding subfolder, where the cache file is stored.

In [17]:
cache_file = fs.get_files(UMAP_2D_CACHE_PATH, recursive=True, relative_to=UMAP_2D_CACHE_PATH)
cache_file

[WindowsPath('fit_transform/98ffc4140f08d25694767acac014fa795b6d341e3944c1faa26f378169d22a10.pkl')]

### Translate indices to class names and vice versa
Sklearn has encoders/decoders as well, but here we want to use the folder names to infer the class labels.

In [18]:
# Extract class indices from filenames
file_encoder = FileToClassLabelDecoder()
train_y = file_encoder(train_y_files)

# Convert indices to class labels
label_decoder = IndexToClassLabelDecoder(train_dataset.classes)
train_y_str = label_decoder(train_y)

### Let's have a look at the label distribution

In [19]:
ctr = Counter(train_y_str)
ctr

Counter({'airplane': 5000,
         'automobile': 5000,
         'bird': 5000,
         'cat': 5000,
         'deer': 5000,
         'dog': 5000,
         'frog': 5000,
         'horse': 5000,
         'ship': 5000,
         'truck': 5000})

## Plot the UMAP dimensionality reduced embedding vectors and an image preview

In [20]:
plotter = EmbeddingPlotter(data=train_reduced_embedding,
                           color=train_y_str,
                           file_list=train_y_files,
                           width=1000)
display(plotter.plot())

Box(children=(FigureWidget({
    'data': [{'hovertemplate': '<b>%{hovertext}</b><br><br>color=airplane<br>x=%{…

## A second run of the notebook will be much faster, as it benefits from the cached results