<a href="https://colab.research.google.com/github/dajopr/lectures/blob/main/lecture09_deep_learning_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop: An Introduction to Deep Learning and Computer Vision

Welcome! In this notebook, we will journey from the foundational concepts of deep learning to a hands-on, practical application: fine-tuning a state-of-the-art model for a custom image classification task.

**Goals:**
1.  **Grasp the Basics:** Understand what Deep Learning and Neural Networks are.
2.  **Learn about CNNs:** Discover Convolutional Neural Networks (CNNs), the workhorse of modern computer vision.
3.  **Understand Transfer Learning:** Build a strong theoretical foundation for transfer learning, one of the most powerful techniques in AI.
4.  **Get Hands-On:** Use modern tools like PyTorch and PyTorch Lightning to train and evaluate a real model.

## Part 1: The Theory

### What is Deep Learning?

Deep Learning is a subfield of machine learning inspired by the structure and function of the human brain. It uses artificial **neural networks** with many layers (hence "deep") to learn complex patterns from large amounts of data.

A neural network is made of interconnected nodes, or *neurons*, organized in layers. Each neuron receives inputs, performs a simple computation, and passes the result to the next layer. By adjusting the connections between neurons, the network can learn to map inputs (like an image) to outputs (like a label).

<p align="center">
    <img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fs41377-024-01590-3/MediaObjects/41377_2024_1590_Fig3_HTML.png" alt="A simple neural network">
</p>
<center>From biology to computation: The image displays the inspiration and architecture of neural networks. (a) A biological neuron. (b) The mathematical model of an artificial neuron. (c) A Multi-Layer Perceptron (MLP), a type of deep neural network, composed of layers of artificial neurons.Neural Network An artificial neural network with an input layer, two hidden layers, and an output layer.</center>

### Convolutional Neural Networks (CNNs) for Vision

While a standard neural network can work with images, it's not ideal. It would require an enormous number of parameters and would lose the spatial relationships between pixels (i.e., it wouldn't know that pixels close to each other form a shape).

**Convolutional Neural Networks (CNNs)** are a specialized type of neural network designed specifically for visual data. They use three main types of layers to process images efficiently.

#### 1. The Convolutional Layer

This is the core building block of a CNN. Instead of looking at every pixel individually, the convolutional layer uses **filters** (also called kernels) to scan over the image and detect specific features like edges, corners, colors, and textures. As the filter slides across the image, it produces a **feature map** that highlights where those features are present.

<p align="center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/2D_Convolution_Animation.gif/220px-2D_Convolution_Animation.gif" alt="A convolution operation">
</p>
<center>Animation of a 3x3 filter sliding over a 5x5 input to produce a 3x3 feature map.</center>

Early layers in the network learn simple features (edges), and deeper layers combine these to learn more complex features (eyes, noses, entire faces).

#### 2. The Pooling Layer

The pooling layer's job is to reduce the spatial size of the feature maps, which reduces the number of parameters and computational cost in the network. This also helps make the detected features more robust to changes in their position in the image.

The most common type is **Max Pooling**, which takes a small window and keeps only the maximum value.

<p align="center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e9/Max_pooling.png/330px-Max_pooling.png" alt="Max pooling operation">
</p>
<center>Max pooling with a 2x2 filter and a stride of 2.</center>

#### 3. The Fully-Connected Layer

After several convolutional and pooling layers have extracted features from the image, the high-level features are flattened into a one-dimensional vector. This vector is then fed into a standard neural network (called a fully-connected layer or dense layer) which acts as a classifier. It takes the learned features and decides which class the image belongs to.
<p align="center">
    <img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fi.stack.imgur.com%2FHvPMa.png&f=1&nofb=1&ipt=35b5ab47dabeb3543c1a750b9011c8d5f4de8a85805eea94cdda343b5131787e" alt="A typical CNN architecture">
</p>
<center>The architecture of LeNet-5, one of the earliest successful CNNs. It combines convolutional and pooling layers to extract features, followed by fully-connected layers for classification.</center>

### Beyond Classification: Other Common Computer Vision Tasks

While we are focusing on classification (assigning a single label to an image), it's useful to know about other powerful tasks that CNNs can perform. These tasks differ in the *modality* or structure of their output.

---

#### 1. Image Classification & Localization
* **Question:** What is in the image, and roughly where is it?
* **Output:** A class label and a single **bounding box** that outlines the main object.
* **Example:** "This image contains a 'cat' at these coordinates [x, y, width, height]."

<p align="center">
    <img src="https://miro.medium.com/v2/resize:fit:304/1*uI4AaqoDew9p9YRsVFDZNg.png" alt="Classification and Localization">
    <br>
    <em>Classification and Localization</em>
</p>

---

#### 2. Object Detection
* **Question:** What objects are in the image, and where are they?
* **Output:** Multiple **bounding boxes**, each with its own class label. This is a step up from localization as it can handle multiple objects.
* **Example:** "Found a 'dog' at [box 1], a 'bicycle' at [box 2] and a 'truck' [box 3]."

<p align="center">
    <img src="https://viso.ai/wp-content/uploads/2021/02/yolo-object-detection.jpg" alt="Object Detection">
    <br>
    <em>Object detection</em>
</p>

---

#### 3. Semantic Segmentation
* **Question:** What is the exact outline of each *category* of object in the image?
* **Output:** A pixel-level mask. Every pixel in the image is assigned a class label (e.g., 'car', 'road', 'sky', 'person'). It does not distinguish between different instances of the same class.
* **Example:** "All these pixels belong to 'dog', these pixels belong to 'grass', these pixels to 'wall'."

#### 4. Instance Segmentation
* **Question:** What is the exact outline of each *individual object* in the image?
* **Output:** A pixel-level mask for each distinct object instance. This is a combination of object detection and semantic segmentation.
* **Example:** "This is 'dog 1', this is 'dog 2'"

#### 5. Panoptic Segmentation
* **Question:** What is the outline of every object and background region in the image, combining both semantic and instance segmentation?
* **Output:** Each pixel is assigned both a class label and, for countable objects, an instance ID. Uncountable regions (like sky, road) are labeled as "stuff," while countable objects (like people, cars) are labeled as "things" with unique IDs.
* **Example:** "These pixels are 'dog 1', these pixels are 'dog 2', these are 'grass' and these are 'wall'"

<p align="center">
    <img src="https://images.prismic.io/encord/89dcc49f-2ce2-4b93-bbcb-7d0c96f2ebda_Instance+Segmentation+-+Encord.png" alt="Instance Segmentation">
    <br>
</p>
<center>
    <em>
        a) Original image &nbsp;&nbsp; b) Semantic segmentation &nbsp;&nbsp; c) Instance segmentation &nbsp;&nbsp; d) Panoptic segmentation
    </em>
</center>

## Part 2: Hands-On Transfer Learning

Now that we have a theoretical base, let's put it into practice. We will fine-tune a pre-trained CNN to classify different types of vehicles.


### 1. Setup - Installing Necessary Libraries

In [None]:
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -q pytorch-lightning timm matplotlib

### 2. Data Preparation

To follow along with this workshop, please add the following Google Drive folder to your own Google Drive:

**[Vehicle Images Dataset - Google Drive Link](https://drive.google.com/drive/folders/16aqOtiz56fG_F02aD-95P_68YnHsdngn?usp=sharing)**

1. Click the link above.
2. Click the "Add shortcut to Drive" button at the top.
3. Choose a location in your Drive and confirm.

This will make the dataset accessible for mounting and use in Colab.

In [None]:
# mount google drive folder
from google.colab import drive

drive.mount("/content/drive/")

import shutil

# Copy images to local folder for faster access
shutil.copytree("/content/drive/MyDrive/vehicles", "/content", dirs_exist_ok=True)

In [None]:
import torch
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import timm
import matplotlib.pyplot as plt
import numpy as np

# Download and extract the dataset
DATA_DIR = "C:/users/danielproell/Downloads/vehicles"

#### The PyTorch Lightning DataModule

A Lightning DataModule is a standardized way to organize and encapsulate all the code related to data processing in PyTorch Lightning. It separates the data processing logic from the model code, making your project more organized and maintainable.

#### Core Components of a DataModule

##### 1. Data Preparation and Loading
- **prepare_data()**: Downloads data, processes data, etc. Called only on one GPU in distributed settings.
- **setup()**: Creates datasets, splits data, etc. Called on every GPU.
- **DataLoaders**: Methods that return PyTorch DataLoaders for training, validation, and testing.

##### 2. Transforms
Transforms are essential for several reasons:
- **Data Augmentation**: Increases the diversity of your training data by applying random transformations (rotations, flips, color jittering) to help your model generalize better.
- **Normalization**: Standardizes your input data to have similar statistical properties, which helps neural networks converge faster during training.
- **Preprocessing**: Converts raw data into a format suitable for your model (resizing images, converting to tensors, etc.).

##### 3. Dataset Organization
- Defines how raw data is converted into PyTorch Datasets
- Handles data splitting (train/val/test)
- Manages any data sampling strategies

#### Benefits of Using a DataModule

1. **Reproducibility**: Encapsulates all randomization and processing steps.
2. **Portability**: Makes it easy to share and reuse data pipelines across projects.
3. **Code Organization**: Separates data concerns from model architecture.
4. **Distributed Training Support**: Works seamlessly in multi-GPU environments.

#### Example Structure

```python
class MyDataModule(pl.LightningDataModule):
    def __init__(self, data_dir, batch_size, transforms=None):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.transforms = transforms
        
    def prepare_data(self):
        # Download data, process data, etc.
        # Only called on one GPU
        
    def setup(self, stage=None):
        # Create datasets, split data, etc.
        # Called on every GPU
        
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size)
        
    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=self.batch_size)
        
    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size)

In [None]:
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import ImageFolder
import pytorch_lightning as pl
import os


class VehicleDataModule(pl.LightningDataModule):
    def __init__(self, data_dir, batch_size=32, num_workers=2, img_size=224):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.img_size = img_size
        self.save_hyperparameters()
        self.persistent_workers = os.name == "nt"

        self.transform = transforms.Compose(
            [
                transforms.Resize((img_size, img_size)),
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )

        self.train_transform = transforms.Compose(
            [
                transforms.Resize((img_size, img_size)),
                transforms.ToTensor(),
                transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )

    def setup(self, stage=None):
        self.train_dataset = ImageFolder(
            self.data_dir + "/train", transform=self.train_transform
        )

        self.val_dataset = ImageFolder(self.data_dir + "/val", transform=self.transform)
        self.class_names = self.train_dataset.classes
        print(f"Found {len(self.class_names)} classes: {self.class_names}")

    def train_dataloader(self):
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=self.num_workers,
            persistent_workers=self.persistent_workers,
        )

    def val_dataloader(self):
        return DataLoader(
            self.val_dataset,
            batch_size=self.batch_size,
            num_workers=self.num_workers,
            persistent_workers=self.persistent_workers,
        )


data_module = VehicleDataModule(data_dir=DATA_DIR)
data_module.setup()

### 3. The Theory of Transfer Learning

**What is it?**

Transfer learning is a technique where a model trained on a large, general task (e.g., classifying 1000 types of objects in ImageNet) is reused as the starting point for a different, more specific task (e.g., classifying our vehicle types).

**Why use it?**

1.  **Less Data Needed:** You leverage the 'knowledge' the model has already gained, so you don't need a massive dataset for your specific problem.
2.  **Faster Training:** Training converges much faster because the model's weights are already a great starting point.
3.  **Better Performance:** Pre-trained models are often highly optimized and learn very effective general features.

We do this by taking the pre-trained network, freezing the early layers that detect general features, and only training (or *fine-tuning*) the final layers to specialize them for our task.

#### The PyTorch Lightning Module

The `LightningModule` organizes our model, training, and validation logic in one clean class.

In [None]:
import torch.nn as nn
import torch.optim as optim
import torchmetrics


class VehicleClassifier(pl.LightningModule):
    def __init__(self, num_classes, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()
        self.model = timm.create_model(
            "efficientnetv2_rw_s.ra2_in1k", pretrained=True, num_classes=num_classes
        )
        self.loss_function = nn.CrossEntropyLoss()

        metrics = torchmetrics.MetricCollection(
            {
                "accuracy": torchmetrics.classification.MulticlassAccuracy(
                    num_classes=num_classes
                ),
                "precision": torchmetrics.classification.MulticlassPrecision(
                    num_classes=num_classes, average="macro"
                ),
                "recall": torchmetrics.classification.MulticlassRecall(
                    num_classes=num_classes, average="macro"
                ),
                "f1": torchmetrics.classification.MulticlassF1Score(
                    num_classes=num_classes, average="macro"
                ),
            }
        )
        self.train_metrics = metrics.clone(prefix="train_")
        self.val_metrics = metrics.clone(prefix="val_")

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        images, labels = batch
        predictions = self.forward(images)
        loss = self.loss_function(predictions, labels)
        self.train_metrics.update(predictions, labels)

        self.log("train_loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        images, labels = batch
        predictions = self.forward(images)
        loss = self.loss_function(predictions, labels)
        self.val_metrics.update(predictions, labels)
        self.log("val_loss", loss, prog_bar=True)
        return loss

    def on_train_epoch_end(self):
        train_metrics = self.train_metrics.compute()
        self.log_dict(train_metrics, prog_bar=True)
        self.train_metrics.reset()

    def on_validation_epoch_end(self):
        val_metrics = self.val_metrics.compute()
        self.log_dict(val_metrics, prog_bar=True)
        self.val_metrics.reset()

    def configure_optimizers(self):
        optimizer = optim.AdamW(self.parameters(), lr=self.hparams.learning_rate)
        return optimizer


model = VehicleClassifier(num_classes=len(data_module.class_names))

### 4. Train the Model

**Tasks**
1. Configure a new Trainer. In addition to setting max_epochs=5, add the parameter limit_train_batches=3. This tells the trainer to only use the first 3 batches of data for training in each epoch.
2. Train your model for 5 epochs using all the available training data. This will give us a standard performance metric to which we can compare our other experiments.
    - Ensure that you instantiate a new model (with a new name) and trainer so that the training begins from scratch
3. Compare the resulting models using the inference code and the tensorboard dashboard below
4. Go to torchvision's homepage and select a transform to add to the training process that you think might make the model learn to generalize to unseen images better and add it to the train transforms in the datamodule above. Train a new model with limited data and see if the model achieved better results than without the transform
    - https://docs.pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_illustrations.html#sphx-glr-auto-examples-transforms-plot-transforms-illustrations-py


In [None]:
model = VehicleClassifier(num_classes=len(data_module.class_names))
trainer = pl.Trainer()

print("Starting model fine-tuning...")
trainer.fit(model, data_module)
print("Training finished!")

print(f"Best model path: {trainer.checkpoint_callback.best_model_path}")

best_model = VehicleClassifier.load_from_checkpoint(
    trainer.checkpoint_callback.best_model_path
)
best_model.eval()
best_model.to("cuda" if torch.cuda.is_available() else "cpu")

### 5. Making Predictions (Inference)

In [None]:
val_loader = data_module.val_dataloader()
images, labels = next(iter(val_loader))

with torch.no_grad():
    logits = best_model(images.to(best_model.device))
    preds = torch.argmax(logits, dim=1)

class_names = data_module.class_names

inv_normalize = transforms.Normalize(
    mean=[-0.485 / 0.229, -0.456 / 0.224, -0.406 / 0.225],
    std=[1 / 0.229, 1 / 0.224, 1 / 0.225],
)

plt.figure(figsize=(15, 10))
for i in range(16):
    if i >= len(images):
        break
    ax = plt.subplot(4, 4, i + 1)
    img = inv_normalize(images[i])
    img = img.permute(1, 2, 0)

    predicted_class = class_names[preds[i]]
    true_class = class_names[labels[i]]
    title_color = "g" if predicted_class == true_class else "r"

    plt.imshow(img.cpu().numpy())
    plt.title(f"True: {true_class}\nPred: {predicted_class}", color=title_color)
    plt.axis("off")

plt.tight_layout()
plt.show()

### 5a. Training dashboard

In [None]:
%load_ext tensorboard
%tensorboard --logdir .

## 6. Conclusion

Congratulations! You have successfully walked through a complete, end-to-end deep learning pipeline. You have:

- Learned the **theory** behind Deep Learning and CNNs.
- Understood the power and process of **Transfer Learning**.
- Organized a data pipeline using a **PyTorch Lightning DataModule**.
- Built and trained a model using a **LightningModule**, leveraging a state-of-the-art **EfficientNetV2** architecture.
- Used the trained model to **make predictions** on unseen data and visualize the results.

This workshop demonstrates a fundamental and highly effective workflow in modern computer vision. The skills you've learned here are directly applicable to a wide range of real-world problems.

# Workshop: Exploring Computer Vision Tasks with Hugging Face 🤗

Welcome! In the previous workshop, we saw how to fine-tune a model for a specific classification task. But what if we want to perform other tasks, like finding multiple objects in an image or editing a picture with text commands, without having to train a new model from scratch?

This is where the Hugging Face Hub and its `pipeline` API come in. We can use powerful, pre-trained models for a huge variety of tasks with just a few lines of code.

**Goals:**
1.  **Understand Zero-Shot Learning:** See how a model can perform tasks it wasn't explicitly trained for, like finding specific parts of a car.
2.  **Perform Object Detection:** Use a zero-shot model to identify and locate different parts of a vehicle in an image.
3.  **Perform Image Segmentation:** Learn the difference between bounding boxes and pixel-level masks by using a model to segment an image.
4.  **Perform Image-to-Image Transformation:** Have some fun by editing images based on text instructions.

## 1. Setup - Installing Necessary Libraries

First, we need to install the libraries. `transformers` is the core Hugging Face library, `timm` is needed for some vision model backbones, and `Pillow` is used for image manipulation.

In [None]:
!pip install -q transformers torch timm Pillow

## 2. Data Preparation

Let's use the same vehicle dataset from our last session. This will give us a consistent set of images to work with. We'll download it and then pick a few sample images to use in our exercises.

In [None]:
import zipfile
from PIL import Image, ImageDraw
import random
import os
import glob

try:
    DATA_DIR = "vehicles"
    train_dir = os.path.join(DATA_DIR, "train")
    classes = list(os.listdir(train_dir))
    image_files = glob.glob(os.path.join(train_dir, "**", "*.jpg"), recursive=True)
    if not image_files:
        raise FileNotFoundError
    SAMPLE_IMAGE_PATH = random.choice(image_files)
    print(f"Using sample image: {SAMPLE_IMAGE_PATH}")
except FileNotFoundError:
    print(
        "Could not find sample images. Please check the dataset path and check drive mounting above."
    )
    SAMPLE_IMAGE_PATH = None

## Task 1: Zero-Shot Object Detection

**Concept:** Standard object detectors are trained on a fixed set of classes (e.g., the 80 classes in the COCO dataset). **Zero-shot** detectors are different. They use a vision-language model (like CLIP) that understands the relationship between images and text. This allows us to provide arbitrary text labels at inference time, and the model can find objects matching those descriptions, even if it never saw that specific label during training!

Let's use this to find not just cars, but specific *parts* of a vehicle.

**Your tasks**
1. Change the labels to be found to include some common car parts in the code below
2. Try the detector on multiple image and look at the predictions
    - Are detections missing? Are things being wrongly detected?
3. Change the score being accepted to a reasonable value
4. (Optional) Try out the prediction pipeline on another task using your own images

In [None]:
from transformers import pipeline

# Load the zero-shot object detection pipeline
print("Loading zero-shot object detection pipeline...")
detector = pipeline(
    model="google/owlvit-base-patch32", task="zero-shot-object-detection"
)
print("Pipeline loaded!")

# Open our sample image
if SAMPLE_IMAGE_PATH:
    image = Image.open(SAMPLE_IMAGE_PATH)
    image_draw = image.copy()
    # Define candidate labels. The model will search for these in the image.
    candidate_labels = ["car"]

    # Run the detection
    predictions = detector(image, candidate_labels=candidate_labels)

    # --- Visualize the results ---
    draw = ImageDraw.Draw(image_draw)

    # Create a color map for labels
    label_colors = {
        label: (
            random.randint(60, 255),
            random.randint(60, 255),
            random.randint(60, 255),
        )
        for label in candidate_labels
    }

    for prediction in predictions:
        box = prediction["box"]
        label = prediction["label"]
        score = prediction["score"]

        if score > 0.15:  # Only draw boxes with a reasonable confidence score
            xmin, ymin, xmax, ymax = box["xmin"], box["ymin"], box["xmax"], box["ymax"]
            color = label_colors[label]

            # Draw the bounding box
            draw.rectangle((xmin, ymin, xmax, ymax), outline=color, width=3)

            # Draw the label and score
            text = f"{label}: {score:.2f}"
            # We don't have a font file in this environment, so we'll use the default
            draw.text((xmin, ymin - 10), text, fill=color)

    print("Displaying detection results:")
    display(image_draw)

## Task 2: Image Segmentation

**Concept:** While object detection draws a box *around* an object, segmentation goes a step further. It classifies every single **pixel** in the image. This gives us a precise, pixel-perfect outline of each object.

We will use a **panoptic segmentation** model, which is a hybrid: it finds individual instances of objects (like object detection) and also classifies background stuff (like semantic segmentation).

**Your tasks**
1. Select a model to use for semantic segmentation
    - Go to huggingface's models (https://huggingface.co/models) page to search for models
    - You can use Facebook's DETR panoptic model with Resnet50 backbone

In [None]:
# Load the image segmentation pipeline
print("Loading image segmentation pipeline...")

model_name = "model_name"

segmenter = pipeline(model=model_name, task="image-segmentation")
print("Pipeline loaded!")

if SAMPLE_IMAGE_PATH:
    image = Image.open(SAMPLE_IMAGE_PATH)
    predictions = segmenter(image)

    # --- Visualize the results ---
    # The 'draw_panoptic_segmentation' function is a utility provided by the model's feature extractor
    # Since we're using the pipeline, we'll recreate a simplified version for visualization.

    def draw_segmentation(original_image, segmentation_results):
        # Create a blank RGBA image to draw the colored masks on
        mask_image = Image.new("RGBA", original_image.size, (0, 0, 0, 0))
        draw = ImageDraw.Draw(mask_image)

        for segment in segmentation_results:
            mask = segment["mask"]
            label = segment["label"]

            # Generate a random color for this segment's label
            color = (
                random.randint(60, 255),
                random.randint(60, 255),
                random.randint(60, 255),
                150,
            )  # RGBA with transparency

            # The mask is a binary image. We can use it to color our overlay.
            draw.bitmap((0, 0), mask, fill=color)

        # Composite the original image with the colored masks
        return Image.alpha_composite(original_image.convert("RGBA"), mask_image)

    segmented_image = draw_segmentation(image, predictions)
    print("Displaying segmentation results:")
    display(segmented_image)

## Task 3: Image Enhancement with Super-Resolution

**Concept:** Super-resolution is a technique that increases the resolution of an image while preserving and enhancing details. We'll use a pre-trained super-resolution model to upscale one of our lower-resolution images.

1. Select a smaller image from your dataset
2. Use the existing Swin2SR pipeline to upscale the image
3. Compare the original and upscaled versions

In [None]:
import torch
import requests
import skimage
import matplotlib.pyplot as plt

# Load the image-to-image pipeline
# Note: This model is larger and may take more time to download.
# We also specify torch_dtype=torch.float16 for memory efficiency if a GPU is available.
device = "cuda" if torch.cuda.is_available() else "cpu"
img2img = pipeline(
    "image-to-image",
    model="caidas/swin2SR-lightweight-x2-64",
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    device=device,
)

image = skimage.io.imread(
    "https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse2.explicit.bing.net%2Fth%2Fid%2FOIP.gw2_0gNMn7S2TMe38z5aRwAAAA%3Fr%3D0%26pid%3DApi&f=1&ipt=3f1ca19f9f3d3ba31ad4ca4cb8b0ab575eccba80dade207783e49d555d673628&ipo=images"
)
image = Image.fromarray(image)
if SAMPLE_IMAGE_PATH:
    # Run the transformation
    transformed_image = image

    # --- Visualize the results ---
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
    ax1.imshow(image)
    ax1.set_title("Original Image")
    ax1.axis("off")

    ax2.imshow(transformed_image)
    ax2.set_title("Transformed: 'Upscaled'")
    ax2.axis("off")

    plt.show()

## Conclusion

In this workshop, we barely scratched the surface of what's possible with pre-trained models from the Hugging Face Hub. We saw how to:

- **Detect custom objects** with a zero-shot detector.
- **Create pixel-perfect masks** with a segmentation model.
- **Edit images like a pro** using a simple text command.

The key takeaway is that you don't always need to train a model from scratch. By leveraging the work of the broader community, you can build powerful and exciting computer vision applications quickly and effectively.

# Optional exercise – Fire and Smoke Detection

In this optional exercise, you will use a pre-trained model from the Hugging Face Hub to perform a real-world task: detecting fire and smoke in a dataset of images. This demonstrates the power of transfer learning and using existing models for new applications without having to train from scratch.

- Use some of the images found in this drive:
    https://drive.google.com/drive/folders/1V61wHR-CjFRaV2RXJfDw0ZzJ5zQJFvpW?usp=sharing
- Evaluate the performance of the object detection model
