<a href="https://colab.research.google.com/github/thesteve0/impatient-computer-vision/blob/main/2_classify_embed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification and Embedding

We are going to do our housekeep steps which will take a little while to run. While they are running we will go back to slides and I will introduce the topics.

### Housekeeping
Before we do anything else, we are need to change our machine time to one that has a GPU. Doing computer vision tasks with a CPU, except for some specific models, is extremely slow. One of the reasons we are using Colab is that you can get free access to a GPU for the workshop.

**FROM THIS NOTEBOOK FORWARD YOU WILL NEED TO CONNECT TO THIS RUNTIME TYPE**

Please:
1. Go up to the top right of the browser
2. Select "Connect"
3. Then "Change Runtime Type"
![change_runtime](https://github.com/thesteve0/impatient-computer-vision/blob/main/assets/2_pick_GPU1.png?raw=1)

4. Pick T4 GPU
5. Click Save
![pick GPU](https://github.com/thesteve0/impatient-computer-vision/blob/main/assets/2_pick_GPU2.png?raw=1)

6. When the runtime connects, it should look like this
![running GPU](https://github.com/thesteve0/impatient-computer-vision/blob/main/assets/2_pick_GPU3.png?raw=1)


Now Time to do our long running tasks
1. Map the drive
2. Load the dependencies
3. Load the data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

!pip install fiftyone==1.4.1 torch torchvision umap-learn


import fiftyone as fo

name = "our-photos"
dir = "/content/drive/MyDrive/impatient-cv/flickr-labeled"

dataset = fo.Dataset.from_dir(
    dataset_dir=dir,
    dataset_type=fo.types.FiftyOneDataset,
    name=name
)

print(dataset)

Mounted at /content/drive
Collecting fiftyone==1.4.1
  Downloading fiftyone-1.4.1-py3-none-any.whl.metadata (23 kB)
Collecting aiofiles (from fiftyone==1.4.1)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting argcomplete (from fiftyone==1.4.1)
  Downloading argcomplete-3.6.2-py3-none-any.whl.metadata (16 kB)
Collecting boto3 (from fiftyone==1.4.1)
  Downloading boto3-1.38.14-py3-none-any.whl.metadata (6.6 kB)
Collecting dacite<1.8.0,>=1.6.0 (from fiftyone==1.4.1)
  Downloading dacite-1.7.0-py3-none-any.whl.metadata (14 kB)
Collecting ftfy (from fiftyone==1.4.1)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting hypercorn>=0.13.2 (from fiftyone==1.4.1)
  Downloading hypercorn-0.17.3-py3-none-any.whl.metadata (5.4 kB)
Collecting kaleido!=0.2.1.post1 (from fiftyone==1.4.1)
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl.metadata (15 kB)
Collecting mongoengine~=0.29.1 (from fiftyone==1.4.1)
  Downloading mongoengine-0.29.1-py3-non

INFO:fiftyone.utils.data.importers:Importing samples...


 100% |█████████████████| 300/300 [14.3ms elapsed, 0s remaining, 21.0K samples/s]     


INFO:eta.core.utils: 100% |█████████████████| 300/300 [14.3ms elapsed, 0s remaining, 21.0K samples/s]     


Name:        our-photos
Media type:  image
Num samples: 300
Persistent:  False
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    open_clip_embed:  fiftyone.core.fields.VectorField
    ground_truth:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)


## Classification

As we discussed in the slides, Classification is the computer vision task where you try to assign an image to single class out of a list of classes. We are going to use a classification model that is the foundation for many other models and is still quite powerful - ResNet. We are going to use the simplement version, ResNet18, because:

1. It doesn't require much GPU resources
2. It is fast to compute

There are many variations to ResNet where a number is appended to the name. This number usually represents the number of layers in the neural network.

### Training data

While ResNet18 has a specific architecture, to use it for predictions, the model needs to be trained on data. There are many foundational data sets in computer vision but, a partciularly common one is [ImageNet](https://www.image-net.org/index.php). This dataset has 1k classes and millions of annotated images.

Please open the list of the [imagenet classes](https://deeplearning.cms.waikato.ac.nz/user-guide/class-maps/IMAGENET/) in another browser tab. We will be referring to this later in the notebook

FiftyOne has a [dataset zoo](https://docs.voxel51.com/dataset_zoo/datasets.html) where many important computer vision datasets have been converted into FiftyOne format and are easy to download and view.

Let's go ahead and download and view a small subset of the ImageNet Data, the [ImageNet Sample Data](https://docs.voxel51.com/dataset_zoo/datasets.html#imagenet-sample)

In [3]:
import fiftyone.zoo as foz

imagenet_samples = foz.load_zoo_dataset("imagenet-sample")

session = fo.launch_app(imagenet_samples, auto=False)

session.url


Downloading dataset to '/root/fiftyone/imagenet-sample'


INFO:fiftyone.zoo.datasets:Downloading dataset to '/root/fiftyone/imagenet-sample'


Downloading dataset...


INFO:fiftyone.zoo.datasets.base:Downloading dataset...


 100% |████|  762.4Mb/762.4Mb [1.6s elapsed, 0s remaining, 570.8Mb/s]      


INFO:eta.core.utils: 100% |████|  762.4Mb/762.4Mb [1.6s elapsed, 0s remaining, 570.8Mb/s]      


Extracting dataset...


INFO:fiftyone.zoo.datasets.base:Extracting dataset...


Parsing dataset metadata


INFO:fiftyone.zoo.datasets.base:Parsing dataset metadata


Found 1000 samples


INFO:fiftyone.zoo.datasets.base:Found 1000 samples


Dataset info written to '/root/fiftyone/imagenet-sample/info.json'


INFO:fiftyone.zoo.datasets:Dataset info written to '/root/fiftyone/imagenet-sample/info.json'


Loading 'imagenet-sample'


INFO:fiftyone.zoo.datasets:Loading 'imagenet-sample'


 100% |███████████████| 1000/1000 [490.7ms elapsed, 0s remaining, 2.0K samples/s]      


INFO:eta.core.utils: 100% |███████████████| 1000/1000 [490.7ms elapsed, 0s remaining, 2.0K samples/s]      


Dataset 'imagenet-sample' created


INFO:fiftyone.zoo.datasets:Dataset 'imagenet-sample' created


Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.



Welcome to

███████╗██╗███████╗████████╗██╗   ██╗ ██████╗ ███╗   ██╗███████╗
██╔════╝██║██╔════╝╚══██╔══╝╚██╗ ██╔╝██╔═══██╗████╗  ██║██╔════╝
█████╗  ██║█████╗     ██║    ╚████╔╝ ██║   ██║██╔██╗ ██║█████╗
██╔══╝  ██║██╔══╝     ██║     ╚██╔╝  ██║   ██║██║╚██╗██║██╔══╝
██║     ██║██║        ██║      ██║   ╚██████╔╝██║ ╚████║███████╗
╚═╝     ╚═╝╚═╝        ╚═╝      ╚═╝    ╚═════╝ ╚═╝  ╚═══╝╚══════╝ v1.4.1

If you're finding FiftyOne helpful, here's how you can get involved:

|
|  ⭐⭐⭐ Give the project a star on GitHub ⭐⭐⭐
|  https://github.com/voxel51/fiftyone
|
|  🚀🚀🚀 Join the FiftyOne Discord community 🚀🚀🚀
|  https://community.voxel51.com/
|



INFO:fiftyone.core.session.session:
Welcome to

███████╗██╗███████╗████████╗██╗   ██╗ ██████╗ ███╗   ██╗███████╗
██╔════╝██║██╔════╝╚══██╔══╝╚██╗ ██╔╝██╔═══██╗████╗  ██║██╔════╝
█████╗  ██║█████╗     ██║    ╚████╔╝ ██║   ██║██╔██╗ ██║█████╗
██╔══╝  ██║██╔══╝     ██║     ╚██╔╝  ██║   ██║██║╚██╗██║██╔══╝
██║     ██║██║        ██║      ██║   ╚██████╔╝██║ ╚████║███████╗
╚═╝     ╚═╝╚═╝        ╚═╝      ╚═╝    ╚═════╝ ╚═╝  ╚═══╝╚══════╝ v1.4.1

If you're finding FiftyOne helpful, here's how you can get involved:

|
|  ⭐⭐⭐ Give the project a star on GitHub ⭐⭐⭐
|  https://github.com/voxel51/fiftyone
|
|  🚀🚀🚀 Join the FiftyOne Discord community 🚀🚀🚀
|  https://community.voxel51.com/
|



'https://5151-gpu-t4-s-laa75geddc9l-b.us-west4-1.prod.colab.dev?polling=true'

### FiftyOne Model Zoo

The computer vision platform we have been using, FiftyOne, also has a set of models already converted into a format that works with the rest of the FiftyOne platform. Typically, you would have to use library specific code, such as PyTorch, along with other code to specify the architecture to run a computer vision model. With FiftyOne, we can load the model in one line of code,  and then run it for classification (inference) with another line of code. Two lines of code and you are in business.

#### ResNet18 in the model zoo

We are going to load the PytTorch version of [ResNet18 model](https://docs.voxel51.com/model_zoo/models.html#resnet18-imagenet-torch) that was trained on ImageNet

In [12]:
resnet18_imagenet_model = foz.load_zoo_model("resnet18-imagenet-torch")


In [19]:
print("Has logits:", resnet18_imagenet_model.has_logits)
print("Return logits:", resnet18_imagenet_model.return_logits)


Has logits: True


AttributeError: 'TorchvisionImageModel' object has no attribute 'return_logits'

### Predictions of our Photos

We loaded our Flickr dataset and we have loaded our classification model, time to have it predict the classifications for our images.

In [20]:
dataset.apply_model(resnet18_imagenet_model, label_field="rn18_in_predictions", num_workers=12, progress_bar=True, store_logits=True)

# Now let's look at the results
session.dataset = dataset



 100% |█████████████████| 300/300 [5.3s elapsed, 0s remaining, 65.6 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 300/300 [5.3s elapsed, 0s remaining, 65.6 samples/s]      


In [21]:
session = fo.launch_app(dataset,auto=False)
session.url

Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.


'https://5151-gpu-t4-s-laa75geddc9l-b.us-west4-1.prod.colab.dev?polling=true'

#### Deep dive on the horse

I want us to dig is on one particular sample


In [22]:
horse_valley = dataset["6773012fa08cade6ec7e44f2"]

session.sample_id = horse_valley["id"]

In [23]:
print(horse_valley)


<Sample: {
    'id': '6773012fa08cade6ec7e44f2',
    'media_type': 'image',
    'filepath': '/content/drive/MyDrive/impatient-cv/flickr-labeled/data/52383615406_11a126d6a1_z.jpg',
    'tags': [],
    'metadata': <ImageMetadata: {
        'size_bytes': 79364,
        'mime_type': 'image/jpeg',
        'width': 640,
        'height': 427,
        'num_channels': 3,
    }>,
    'created_at': datetime.datetime(2025, 5, 13, 18, 57, 53, 192000),
    'last_modified_at': datetime.datetime(2025, 5, 13, 19, 33, 55, 978000),
    'open_clip_embed': array([ 5.77109009e-02,  3.25371355e-01,  3.74628663e-01,  1.14519350e-01,
           -2.18840286e-01, -3.10990393e-01,  7.51966178e-01,  1.15380794e-01,
           -5.84102988e-01,  6.53548837e-01,  4.52027500e-01, -3.98815155e-01,
            1.29895747e-01, -4.45800662e-01,  5.91575086e-01,  1.45337433e-01,
            2.37344861e-01, -1.42008401e-02,  2.01493084e-01, -2.20673829e-01,
           -2.92100370e-01,  2.61853456e-01, -2.52594590e-01, -1.1

Now let's see what the generated predictions tell us

In [24]:
import torch.nn.functional as TF
import torch

model_classes = resnet18_imagenet_model.classes
logits = torch.from_numpy(horse_valley["rn18_in_predictions"]["logits"])

print("There are " + str(len(logits))+ " logits")

print("\nHere are all the logits")
print(str(logits[:25]))

confidences = TF.softmax(logits, dim=0)
print("\nHere are all the confidence scores")
print(str(confidences[:25]))

# Get top 5 values and their indices
top_values, top_indices = torch.topk(confidences, k=5)

print("Top 5 confidence values:", top_values)
print("Their indices:", top_indices)

print("\nPredictions in descending confidence:\n")
for idx, value in zip(top_indices.tolist(), top_values.tolist()):
    print("Prediction: " + model_classes[idx] + " \tConfidence: " + str(value))

There are 1000 logits

Here are all the logits
tensor([ 0.6507, -1.4213, -0.5859, -0.8789, -0.5630, -0.8333, -0.6420,  0.3065,
         0.2298,  2.0134, -0.0829, -0.4339,  0.5695, -0.1943, -0.9017,  0.7547,
         0.1513,  0.3110,  0.8502,  0.5325,  1.4155,  1.5713,  2.0315,  2.4502,
         1.1756])

Here are all the confidence scores
tensor([6.2269e-04, 7.8414e-05, 1.8080e-04, 1.3488e-04, 1.8500e-04, 1.4118e-04,
        1.7094e-04, 4.4135e-04, 4.0875e-04, 2.4326e-03, 2.9899e-04, 2.1047e-04,
        5.7411e-04, 2.6746e-04, 1.3184e-04, 6.9092e-04, 3.7791e-04, 4.4331e-04,
        7.6014e-04, 5.5326e-04, 1.3379e-03, 1.5634e-03, 2.4770e-03, 3.7649e-03,
        1.0525e-03])
Top 5 confidence values: tensor([0.0430, 0.0416, 0.0298, 0.0259, 0.0149])
Their indices: tensor([349, 979, 970, 350, 672])

Predictions in descending confidence:

Prediction: bighorn 	Confidence: 0.04295031726360321
Prediction: valley 	Confidence: 0.04159349575638771
Prediction: alp 	Confidence: 0.029837749898433685


### Discussing the results

1. What are some of the main things you noticed about the predictions?
2. Were the predicted classes surprising to you? Were they useful for our problem?
3. Take home bonus - What did changing the number of workers do?

Here are the important ideas I wanted you to take away

1. The model only can predict classes it was trained on
2. The model will associate the most similar images of its training data to the current image and then give it that class


## Another ResNet Model

To demonstrate the importance of training data, we are going to run another ResNet18 model, except I trained this model on [Pokemon images](https://huggingface.co/datasets/TheSteve0/pokemon).

I put the model weights file in our shared drive.

To use this model we are going to:
1. Load the model into pytorch
2. Run the model against our Flickr images
3. Associate the classification labels back to our FiftyOne dataset
4. View the results

In [25]:
pokemon_class_labels = {0: "Abra", 1: "Aerodactyl", 2: "Alakazam", 3: "Alolan Sandslash", 4: "Arbok", 5: "Arcanine", 6: "Articuno", 7: "Beedrill", 8: "Bellsprout", 9: "Blastoise", 10: "Bulbasaur", 11: "Butterfree", 12: "Caterpie", 13: "Chansey", 14: "Charizard", 15: "Charmander", 16: "Charmeleon", 17: "Clefable", 18: "Clefairy", 19: "Cloyster", 20: "Cubone", 21: "Dewgong", 22: "Diglett", 23: "Ditto", 24: "Dodrio", 25: "Doduo", 26: "Dragonair", 27: "Dragonite", 28: "Dratini", 29: "Drowzee", 30: "Dugtrio", 31: "Eevee", 32: "Ekans", 33: "Electabuzz", 34: "Electrode", 35: "Exeggcute", 36: "Exeggutor", 37: "Farfetchd", 38: "Fearow", 39: "Flareon", 40: "Gastly", 41: "Gengar", 42: "Geodude", 43: "Gloom", 44: "Golbat", 45: "Goldeen", 46: "Golduck", 47: "Golem", 48: "Graveler", 49: "Grimer", 50: "Growlithe", 51: "Gyarados", 52: "Haunter", 53: "Hitmonchan", 54: "Hitmonlee", 55: "Horsea", 56: "Hypno", 57: "Ivysaur", 58: "Jigglypuff", 59: "Jolteon", 60: "Jynx", 61: "Kabuto", 62: "Kabutops", 63: "Kadabra", 64: "Kakuna", 65: "Kangaskhan", 66: "Kingler", 67: "Koffing", 68: "Krabby", 69: "Lapras", 70: "Lickitung", 71: "Machamp", 72: "Machoke", 73: "Machop", 74: "Magikarp", 75: "Magmar", 76: "Magnemite", 77: "Magneton", 78: "Mankey", 79: "Marowak", 80: "Meowth", 81: "Metapod", 82: "Mew", 83: "Mewtwo", 84: "Moltres", 85: "MrMime", 86: "Muk", 87: "Nidoking", 88: "Nidoqueen", 89: "Nidorina", 90: "Nidorino", 91: "Ninetales", 92: "Oddish", 93: "Omanyte", 94: "Omastar", 95: "Onix", 96: "Paras", 97: "Parasect", 98: "Persian", 99: "Pidgeot", 100: "Pidgeotto", 101: "Pidgey", 102: "Pikachu", 103: "Pinsir", 104: "Poliwag", 105: "Poliwhirl", 106: "Poliwrath", 107: "Wigglytuff", 108: "Zapdos", 109: "Zubat"}


import torch
import torchvision.models as models
import torchvision.transforms.v2 as T
from PIL import Image
import fiftyone as fo
from tqdm.notebook import tqdm
import pickle
import os
from torch.utils.data import Dataset, DataLoader
import torch.amp as amp

# Enable CUDA optimization
torch.backends.cudnn.benchmark = True

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load model using pickle
with open('/content/drive/MyDrive/impatient-cv/pokemon-classification-model.pt', 'rb') as f:
    state_dict = pickle.load(f)

# Create a ResNet18 model with no pre-trained weights
model = models.resnet18(weights=None)

# Modify the final layer to match the trained model's 150 output classes --150 pokemon
model.fc = torch.nn.Linear(model.fc.in_features, 150)

# Check if keys have the nested prefix and remove it
if any(k.startswith('model.model.') for k in state_dict.keys()):
    new_state_dict = {k.replace('model.model.', ''): v for k, v in state_dict.items()}
    state_dict = new_state_dict
elif any(k.startswith('model.') for k in state_dict.keys()):
    new_state_dict = {k.replace('model.', ''): v for k, v in state_dict.items()}
    state_dict = new_state_dict

# Load the state dict into the model
model.load_state_dict(state_dict)
model.to(device)
model.eval()

# pre-processing transforms we did for model training
transform = T.Compose([
    T.ToImage(),
    T.RGB(),
    T.ToDtype(torch.float32, scale=True),
    T.Resize(224),
    T.CenterCrop(224),
])

# Load your FiftyOne dataset is already done in the notebook

# Optional: Define class mapping for Pokemon species (if available)
# class_names = {0: "Pikachu", 1: "Charizard", ...}
class_names = pokemon_class_labels  # Set to None if unavailable

# Custom dataset for parallel loading
class PokemonDataset(Dataset):
    def __init__(self, sample_ids, filepaths, transform=None):
        self.sample_ids = sample_ids
        self.filepaths = filepaths
        self.transform = transform

    def __len__(self):
        return len(self.filepaths)

    def __getitem__(self, idx):
        sample_id = self.sample_ids[idx]
        filepath = self.filepaths[idx]

        image = Image.open(filepath).convert('RGB')
        if self.transform:
            image = self.transform(image)
        return sample_id, image

# Extract sample IDs and filepaths
sample_ids = dataset.values("id")
filepaths = dataset.values("filepath")

# Create dataset and dataloader for parallel processing
pokemon_dataset = PokemonDataset(sample_ids, filepaths, transform)
dataloader = DataLoader(
    pokemon_dataset,
    batch_size=64,  # Larger batch size for GPU efficiency
    num_workers=2,  # Reduced worker count to avoid warnings
    pin_memory=True  # Faster data transfer to GPU
)

# Prepare for mixed precision if supported
scaler = amp.GradScaler(enabled=True)

# Create a list to store FiftyOne Classification objects
all_classifications = [None] * len(sample_ids)

# Process in batches
with torch.no_grad():
    for batch_ids, images in tqdm(dataloader):
        # Move images to device
        images = images.to(device, non_blocking=True)

        # Run inference with mixed precision
        with amp.autocast('cuda', enabled=True):
            outputs = model(images)
            probs = torch.nn.functional.softmax(outputs, dim=1)
            confidences, predictions = torch.max(probs, dim=1)

        # Get results from GPU
        predictions = predictions.cpu().numpy()
        confidences = confidences.cpu().numpy()

        # Store results as FiftyOne Classification objects
        for i, sample_id in enumerate(batch_ids):
            # Find index in the original arrays
            idx = sample_ids.index(sample_id)

            pred_idx = int(predictions[i])
            confidence = float(confidences[i])

            # Get class name if mapping exists
            if class_names is not None:
                pred_label = class_names.get(pred_idx, f"Unknown({pred_idx})")
            else:
                # If no class names mapping, use stringified index as label
                pred_label = str(pred_idx)

            # Create a FiftyOne Classification object
            classification = fo.Classification(
                label=pred_label,
                confidence=confidence
            )

            all_classifications[idx] = classification

# Use set_values to update all samples with classification objects in a single batch operation
dataset.set_values("pokemon_classification", all_classifications)

# Save dataset
dataset.save()

print("\nFinished classifying\n\n")

session = fo.launch_app(dataset, auto=False)
session.url

Using device: cuda


  0%|          | 0/5 [00:00<?, ?it/s]


Finished classifying


Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.


'https://5151-gpu-t4-s-laa75geddc9l-b.us-west4-1.prod.colab.dev?polling=true'

## Wrap up

And with that we are done with classification. In the next notebook, we are going to reuse these models to demonstrate the importance of data when doing embeddings

[Embdeddings](https://github.com/thesteve0/impatient-computer-vision/blob/main/3_embeddings.ipynb)