# <center> <h1> 🧶 Связка `Lightning` + `ClearML` в задачах CV. 🖼</h1> </center>

### Оглавление ноутбука
<img src='../images/CV.webp' align="right" width="508" height="428" >
<br>

<p><font size="3" face="Arial" font-size="large"><ul type="square">
    
<li><a href="#p1">🧐 Посмотрим на связку в деле!</a></li>
<li><a href="#p2">🎨 Задача классификации изображений</a></li>
<li><a href="#p7">🎛 Файнтюнинг предобученной модели.</a></li>
<li><a href="#p3">🧶 Связка в задаче семантической сегментации 🧩</a></li>
<li><a href="#p6">🧸 Выводы и заключения ✅ </a></li>


    
</ul></font></p>

### 🧑‍🎓 В этом уроке разберем как связка **PyTorch Lightning** c **ClearML** может работать в CV-задачах. 🪄

<div class="alert alert-info">

Связка **ClearML** с **PyTorch Lightning** хорошо показывает себя в задачах из CV-домена.

* Удобно отслеживать прогресс обучения сетки, особенно, при обучении на удалённых севрверах, а не в ноутбуках.
* Легко запустить файнтюнинг предобученного бэкбона
* Удобно отслеживать прогресс обучения генеративных моделей (GAN и.т.п)
* Плюс все преимущества, которые рассматривали в прошлом уроке.

# <center id="p1">  🧐 Посмотрим на связку в деле!</center>

In [None]:
#!pip install clearml tensorboard lightning torchvision torchmetrics -q

In [1]:
import os
import logging
import pandas as pd

import torch
from lightning import LightningModule, Trainer, seed_everything
from lightning.pytorch.callbacks import ModelCheckpoint, LearningRateMonitor
from torch.utils.data import DataLoader
from torch import nn, optim
import torch.nn.functional as F
import torchvision.models as models
import torchvision.transforms as transforms
import torchvision.datasets as datasets
from torchmetrics import Accuracy

from lightning.pytorch.loggers import TensorBoardLogger

from clearml import Task

In [4]:
#from getpass import getpass
# Введите поочерёдно полученные ключи в появившемся окне (код изменять не нужно)
#access_key = getpass(prompt="Введите API Access токен: ")
#secret_key = getpass(prompt="Введите API Secret токен: ")

Введите API Access токен:  ········
Введите API Secret токен:  ········


In [5]:
# %%capture
# #  Не показывать свои api-ключи
# %env CLEARML_WEB_HOST=https://app.clear.ml/
# %env CLEARML_API_HOST=https://api.clear.ml
# %env CLEARML_FILES_HOST=https://files.clear.ml

# %env CLEARML_API_ACCESS_KEY=$access_key
# %env CLEARML_API_SECRET_KEY=$secret_key

# <center> 🎨 Задача классификации изображений </center>

<div class="alert alert-info">

Напишем и обучим собственную сетку на датасете `CIFAR10`!

In [2]:
# Инициализация задачи в ClearML
task = Task.init(
    project_name="Image Classification",
    task_name="CIFAR10 Training with Image Logging",
    task_type=Task.TaskTypes.training
)

ClearML Task: created new task id=bc5186058e6e4411aa63848e1a1bb9f5
2025-02-17 16:16:15,713 - clearml.Task - INFO - Storing jupyter notebook directly as code
ClearML results page: https://app.clear.ml/projects/0cb5f71423834f7ba18c3a50e5989fd3/experiments/bc5186058e6e4411aa63848e1a1bb9f5/output/log


In [3]:
config = {
    "num_epochs": 3,
    "batch_size": 24,
    "learning_rate": 0.001,
    "dropout_rate": 0.25,
    "num_images_to_log": 5,  # Количество изображений для логирования
}
config = task.connect(config)  # Добавление гиперпараметров в ClearML
print("Hyperparameters:", config)

Hyperparameters: {'num_epochs': 3, 'batch_size': 24, 'learning_rate': 0.001, 'dropout_rate': 0.25, 'num_images_to_log': 5}


Используем датасет CIFAR10

In [4]:
# Классы CIFAR-10
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# Подготовка данных
transform = transforms.Compose([transforms.ToTensor()])
train_dataset = datasets.CIFAR10(root="./data", train=True, download=True, transform=transform)
test_dataset = datasets.CIFAR10(root="./data", train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=config["batch_size"], shuffle=True, num_workers=2)
val_loader = DataLoader(test_dataset, batch_size=config["batch_size"], shuffle=False, num_workers=2)

Files already downloaded and verified
Files already downloaded and verified


## Модель

In [5]:
class CIFAR10Model(LightningModule):
    def __init__(self, num_classes=10, dropout_rate=0.25, learning_rate=0.001):
        super().__init__()
        self.save_hyperparameters()
        self.model = nn.Sequential(
            nn.Conv2d(3, 6, kernel_size=3),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(6, 16, kernel_size=3),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(16 * 6 * 6, 120),
            nn.ReLU(),
            nn.Linear(120, 84),
            nn.ReLU(),
            nn.Dropout(p=dropout_rate),
            nn.Linear(84, num_classes),
        )
        self.loss_fn = nn.CrossEntropyLoss()
        self.accuracy = Accuracy(task="multiclass", num_classes=num_classes, top_k=1)

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        loss = self.loss_fn(outputs, labels)
        self.log("train_loss", loss, on_step=True, on_epoch=True)
        task.get_logger().report_scalar("Loss", "train", loss.item(), self.global_step)
        return loss

    def validation_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        loss = self.loss_fn(outputs, labels)
        acc = self.accuracy(outputs, labels)
        self.log("val_loss", loss, on_step=False, on_epoch=True)
        self.log("val_acc", acc, on_step=False, on_epoch=True)
        task.get_logger().report_scalar("Loss", "val", loss.item(), self.global_step)
        task.get_logger().report_scalar("Accuracy", "val", acc.item(), self.global_step)

        # Логирование изображений
        if batch_idx == 0:  # Логируем только для первого батча
            _, predicted = outputs.max(1)
            for i in range(min(len(inputs), config["num_images_to_log"])):
                img = transforms.ToPILImage()(inputs[i].cpu())
                pred_label = predicted[i].item()
                true_label = labels[i].item()
                title = f"True: {classes[true_label]}, Pred: {classes[pred_label]}"
                task.get_logger().report_image(
                    title=title,
                    series=f"Batch {self.current_epoch}",
                    iteration=self.global_step,
                    image=img,
                )

    def configure_optimizers(self):
        return optim.SGD(self.parameters(), lr=self.hparams.learning_rate, momentum=0.9)

In [6]:
# Callbacks для мониторинга
callbacks = [
    ModelCheckpoint(monitor="val_acc", mode="max", save_top_k=1),
    LearningRateMonitor(logging_interval="step"),
]

In [7]:
seed_everything(42)

Seed set to 42


42

In [8]:
trainer = Trainer(
    max_epochs=config["num_epochs"],
    accelerator="gpu",# if Trainer.auto_device_count() > 0 else "cpu",
    devices="auto",
    callbacks=callbacks,
    log_every_n_steps=50,
)

Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [9]:
# Запуск обучения
model = CIFAR10Model(
    dropout_rate=config["dropout_rate"],
    learning_rate=config["learning_rate"]
)
trainer.fit(model, train_loader, val_loader)

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name     | Type               | Params | Mode 
--------------------------------------------------------
0 | model    | Sequential         | 81.3 K | train
1 | loss_fn  | CrossEntropyLoss   | 0      | train
2 | accuracy | MulticlassAccuracy | 0      | train
--------------------------------------------------------
81.3 K    Trainable params
0         Non-trainable params
81.3 K    Total params
0.325     Total estimated model params size (MB)
16        Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


In [3]:
print('Task ID number is: {}'.format(task.id))
task.close()

Task ID number is: bc5186058e6e4411aa63848e1a1bb9f5


# <center> 🎛 Файнтюнинг предобученной модели. </center>

In [12]:
# ✅ Инициализация ClearML Task
task = Task.init(
    project_name="Image Classification",
    task_name="Fine-tuning ResNet18 on CIFAR10",
    task_type=Task.TaskTypes.training
)

ClearML Task: created new task id=bc209c2c8a22429588be10ad2e145a63
ClearML results page: https://app.clear.ml/projects/0cb5f71423834f7ba18c3a50e5989fd3/experiments/bc209c2c8a22429588be10ad2e145a63/output/log
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start


In [13]:
# ✅ Гиперпараметры (доступны для редактирования в ClearML)
config = {
    "num_epochs": 10,
    "batch_size": 32,
    "learning_rate": 1e-3,
    "fine_tune_start_epoch": 5,  # С какой эпохи размораживать бэкбон
    "num_classes": 10,
    "num_images_to_log": 5,  # Логирование картинок в ClearML
}
config = task.connect(config)
print("Hyperparameters:", config)

Hyperparameters: {'num_epochs': 10, 'batch_size': 32, 'learning_rate': 0.001, 'fine_tune_start_epoch': 5, 'num_classes': 10, 'num_images_to_log': 5}


In [19]:
# ✅ Lightning-модель для обучения ResNet18
class FineTuneResNet(LightningModule):
    def __init__(self, num_classes=10, learning_rate=1e-3, fine_tune_start_epoch=5):
        super().__init__()
        self.save_hyperparameters()
        
        # Загружаем предобученный ResNet18
        self.model = models.resnet18(pretrained=True)
        
        # Заменяем последний слой на 10 классов (CIFAR-10)
        in_features = self.model.fc.in_features
        self.model.fc = nn.Linear(in_features, num_classes)
        
        # Замораживаем веса бэкбона (кроме fc)
        for param in self.model.parameters():
            param.requires_grad = False
        for param in self.model.fc.parameters():
            param.requires_grad = True
        
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        loss = self.loss_fn(outputs, labels)
        self.log("train_loss", loss, on_step=True, on_epoch=True)

        # Логирование в ClearML
        task.get_logger().report_scalar("Loss", "train", loss.item(), self.global_step)

        return loss

    def validation_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        loss = self.loss_fn(outputs, labels)
        acc = (outputs.argmax(dim=1) == labels).float().mean()
        
        self.log("val_loss", loss, on_epoch=True)
        self.log("val_acc", acc, on_epoch=True)

        task.get_logger().report_scalar("Loss", "val", loss.item(), self.global_step)
        task.get_logger().report_scalar("Accuracy", "val", acc.item(), self.global_step)

        # Логирование изображений в ClearML
        if batch_idx == 0:
            _, predicted = outputs.max(1)
            for i in range(min(len(inputs), config["num_images_to_log"])):
                img = transforms.ToPILImage()(inputs[i].cpu())
                pred_label = predicted[i].item()
                true_label = labels[i].item()
                title = f"True: {classes[true_label]}, Pred: {classes[pred_label]}"
                task.get_logger().report_image(
                    title=title,
                    series=f"Epoch {self.current_epoch}",
                    iteration=self.global_step,
                    image=img,
                )

    def configure_optimizers(self):
        return optim.Adam(self.model.fc.parameters(), lr=self.hparams.learning_rate)

    def on_epoch_start(self):
        # Размораживаем бэкбон после N-й эпохи
        if self.current_epoch >= self.hparams.fine_tune_start_epoch:
            for param in self.model.parameters():
                param.requires_grad = True
            print(f"🔓 Разморозка всех слоев модели на {self.current_epoch}-й эпохе")

<div class="alert alert-info">
    
**🔹 Что делает код?**
1. Загружает предобученную `ResNet18` из `torchvision.models`
2. Меняет последний слой на 10 классов (CIFAR-10)
3. Замораживает веса бэкбона (первые `fine_tune_start_epoch` эпох)
4. Размораживает всю модель после `fine_tune_start_epoch`
5. Логирует метрики и изображения в `ClearML`
6. Сохраняет лучшую модель по `val_acc`

**🔹 Как работает Fine-Tuning?**
* В первые `fine_tune_start_epoch` эпохи тренируется только последний слой (fc).
* Затем размораживается весь `ResNet18` и обучается вся модель.
* 🔥 Это стандартный подход в `Transfer Learning`, который делает `fine-tuning` эффективнее и стабильнее! 🚀

**🔹 Что логируется в ClearML?**

 ✅ Метрики (`train_loss`, `val_loss`, `val_acc`)
 
 ✅ Отладочные изображения с предсказаниями
 
 ✅ Гиперпараметры эксперимента
 
 ✅ Модель с лучшей точностью

In [20]:
model = FineTuneResNet(
    num_classes=config["num_classes"],
    learning_rate=config["learning_rate"],
    fine_tune_start_epoch=config["fine_tune_start_epoch"]
)

callbacks = [
    ModelCheckpoint(monitor="val_acc", mode="max", save_top_k=1),
    LearningRateMonitor(logging_interval="step"),
]

Connecting multiple input models with the same name: `resnet18-f37072fd`. This might result in the wrong model being used when executing remotely


In [21]:
# ✅ Тренировочный процесс
trainer = Trainer(
    max_epochs=config["num_epochs"],
    accelerator="gpu" if torch.cuda.is_available() else "cpu",
    devices="auto",
    callbacks=callbacks,
    log_every_n_steps=50,
)

# ✅ Запуск обучения
trainer.fit(model, train_loader, val_loader)

Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name    | Type             | Params | Mode 
-----------------------------------------------------
0 | model   | ResNet           | 11.2 M | train
1 | loss_fn | CrossEntropyLoss | 0      | train
-----------------------------------------------------
5.1 K     Trainable params
11.2 M    Non-trainable params
11.2 M    Total params
44.727    Total estimated model params size (MB)
69        Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [22]:
task.close()

### Посмотрим на валидационные изображения
Видно как модель постепенно "умнеет" - перестаёт ошибаться в определении классов.

<img src='../images/cif10.png'>

# <center> 🧶 Связка в задаче семантической сегментации 🧩</center>

<div class="alert alert-info">

Разберём пример решения задачи сегментации с использованием `PyTorch Lightning` и `ClearML`. Будем использовать модель `FCN-ResNet50` из `torchvision.models.segmentation` и обучать её на `VOC Segmentation Dataset (Pascal VOC 2012)`.

`VOC Segmentation Dataset (Pascal VOC 2012)` - содержит изображения с объектами 20 классов и 1 фоновый класс.

Так же из `torchmetrics` возьмём `JaccardIndex` для вычисления метрики `IoU`.

В итоге собрали топовую связку: **⚡ Lightning + ClearML + torchmetrics**

In [2]:
# ✅ Инициализация задачи в ClearML
task = Task.init(project_name="VOC-Segmentation", 
                 task_name="Fine-Tuning FCN-ResNet50-VOC",
                 task_type=Task.TaskTypes.training
                )

ClearML Task: created new task id=15137606038b47a68be51eaad6b8dc1d
2025-02-20 09:22:17,134 - clearml.Task - INFO - Storing jupyter notebook directly as code
ClearML results page: https://app.clear.ml/projects/a0f504e0e51b42c7bc9d4531c8c2cf60/experiments/15137606038b47a68be51eaad6b8dc1d/output/log
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start


In [3]:
# ✅ Гиперпараметры (можно менять в ClearML)
config = {
    # Data parameters
    "dataset": "VOC2012",
    "data_root": "./data",
    "image_size": (520, 520),
    "batch_size": 8,
    "num_workers": 4,
    
    # Model parameters
    "model_name": "fcn_resnet50",
    "pretrained": True,
    "num_classes": 21,  # VOC has 20 classes + background
    
    # Training parameters
    "learning_rate": 0.001,
    "num_epochs": 10,
    "optimizer": "Adam",
    "scheduler": "ReduceLROnPlateau",
    "scheduler_params": {
        "mode": "min",
        "factor": 0.5,
        "patience": 2
    },
    
    # Transforms
    "image_normalization": {
        "mean": [0.485, 0.456, 0.406],
        "std": [0.229, 0.224, 0.225]
    },
    
    # Hardware
    "device": "cuda" if torch.cuda.is_available() else "cpu"
}
config = task.connect(config)
print("Hyperparameters:", config)

Hyperparameters: {'dataset': 'VOC2012', 'data_root': './data', 'image_size': (520, 520), 'batch_size': 8, 'num_workers': 4, 'model_name': 'fcn_resnet50', 'pretrained': True, 'num_classes': 21, 'learning_rate': 0.001, 'num_epochs': 10, 'optimizer': 'Adam', 'scheduler': 'ReduceLROnPlateau', 'scheduler_params': {'mode': 'min', 'factor': 0.5, 'patience': 2}, 'image_normalization': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225]}, 'device': 'cuda'}


In [4]:
# ✅ Функция загрузки данных (Pascal VOC 2012)
def get_dataloaders(batch_size):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(
            mean=config["image_normalization"]["mean"], 
            std=config["image_normalization"]["std"]
        ),
        transforms.Resize(config['image_size'], antialias=True)
    ])

    target_transform = transforms.Compose([
        transforms.PILToTensor(),
        transforms.Resize(config['image_size'], interpolation=transforms.InterpolationMode.NEAREST)
    ])
    
    train_dataset = datasets.VOCSegmentation(root=config["data_root"], year="2012", image_set="train", 
                                             download=True, transform=transform, target_transform=target_transform)
    val_dataset = datasets.VOCSegmentation(root=config["data_root"], year="2012", image_set="val", 
                                           download=True, transform=transform, target_transform=target_transform)

    # Log dataset information
    task.get_logger().report_text(
        f"Train dataset size: {len(train_dataset)} samples",
        level=logging.INFO
    )
    task.get_logger().report_text(
        f"Validation dataset size: {len(val_dataset)} samples", 
        level=logging.INFO
    )

    # Названия классов VOC 
    voc_class_names = [
        'background', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 
        'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 
        'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 
        'train', 'tvmonitor']
    # Логируем названия классов
    task.connect({"class_mapping": {i: name for i, name in enumerate(voc_class_names)}})


    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=config["num_workers"])
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=config["num_workers"])

    return train_loader, val_loader

In [5]:
# Залогируем трансформации
transforms_config = {
    "image_transforms": [
        "ToTensor()",
        f"Normalize(mean={config['image_normalization']['mean']}, std={config['image_normalization']['std']})",
        f"Resize(size={config['image_size']}, antialias=True)"
    ],
    "target_transforms": [
        "PILToTensor()",
        f"Resize(size={config['image_size']}, interpolation=InterpolationMode.NEAREST)"
    ]
}

# Log configuration to ClearML
task.connect_configuration({"transforms_config": transforms_config})
print("Transforms:", transforms_config)

Transforms: {'image_transforms': ['ToTensor()', 'Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])', 'Resize(size=(520, 520), antialias=True)'], 'target_transforms': ['PILToTensor()', 'Resize(size=(520, 520), interpolation=InterpolationMode.NEAREST)']}


In [6]:
# ✅ Модель сегментации с предобученным FCN-ResNet50
import torchmetrics
from torchvision.models.segmentation import fcn_resnet50, FCN_ResNet50_Weights
from PIL import Image

class SegmentationModel(LightningModule):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_classes = config["num_classes"]
        self.lr = config["learning_rate"]
        
        # Log model architecture configuration
        self.model_config = {
            "backbone": "resnet50",
            "head": "FCN",
            "pretrained": config["pretrained"],
            "num_classes": self.num_classes,
        }
        
        # Load pretrained FCN-ResNet50 model
        if config["pretrained"]:
            self.model = fcn_resnet50(weights=FCN_ResNet50_Weights.DEFAULT)
        else:
            self.model = fcn_resnet50(weights=None)
            
        self.model.classifier[4] = nn.Conv2d(512, self.num_classes, kernel_size=(1, 1), stride=(1, 1))
        
        # Metrics
        self.train_iou = torchmetrics.JaccardIndex(task="multiclass", num_classes=self.num_classes)
        self.val_iou = torchmetrics.JaccardIndex(task="multiclass", num_classes=self.num_classes)
        
        # Class IoU metrics for detailed analysis
        self.val_class_iou = torchmetrics.JaccardIndex(
            task="multiclass", 
            num_classes=self.num_classes,
            average=None
        )
        
        # ClearML logger
        self.clearml_logger = task.get_logger()
        
        # Log model architecture
        self.clearml_logger.report_text(
            f"Model Architecture: {self.model_config}",
            level=logging.INFO
        )
    
    def forward(self, x):
        return self.model(x)['out']
    
    def training_step(self, batch, batch_idx):
        images, masks = batch
        masks = masks.squeeze(1).long()  # Convert [B, 1, H, W] to [B, H, W]
        
        outputs = self(images)
        loss = F.cross_entropy(outputs, masks)
        
        # Calculate IoU
        preds = torch.argmax(outputs, dim=1)
        iou = self.train_iou(preds, masks)
        
        # Log metrics with ClearML
        self.clearml_logger.report_scalar("Train/Loss", "CE", loss.item(), self.global_step)
        self.clearml_logger.report_scalar("Train/IoU", "Mean", iou.item(), self.global_step)
        
        return loss
    
    def validation_step(self, batch, batch_idx):
        images, masks = batch
        masks = masks.squeeze(1).long()
        
        outputs = self(images)
        loss = F.cross_entropy(outputs, masks)
        
        # Calculate IoU
        preds = torch.argmax(outputs, dim=1)
        iou = self.val_iou(preds, masks)
        
        # Calculate per-class IoU
        self.val_class_iou(preds, masks)
        
        # Log validation images periodically
        if batch_idx == 0:
            self._log_images(images, masks, preds)
            
        return {"val_loss": loss, "val_iou": iou}
    
    def on_validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        avg_iou = self.val_iou.compute()
        
        # Get per-class IoU and log it
        class_ious = self.val_class_iou.compute()
        for i, class_iou in enumerate(class_ious):
            self.clearml_logger.report_scalar(
                f"Val/ClassIoU", 
                voc_class_names[i], 
                class_iou.item(), 
                self.current_epoch
            )
        
        # Reset metrics for next epoch
        self.val_iou.reset()
        self.val_class_iou.reset()
        
        # Log with ClearML
        self.clearml_logger.report_scalar("Val/Loss", "CE", avg_loss.item(), self.current_epoch)
        self.clearml_logger.report_scalar("Val/IoU", "Mean", avg_iou.item(), self.current_epoch)
        
        self.log("val_loss", avg_loss, prog_bar=True)
        self.log("val_iou", avg_iou, prog_bar=True)
    
    def _log_images(self, images, masks, predictions, max_imgs=4):
        """Log images with ClearML for visualization"""
        n_imgs = min(max_imgs, images.shape[0])
        
        # Create a colormap for segmentation masks
        def colorize_mask(mask):
            # Simple colormap (could use a more sophisticated one)
            cmap = plt.cm.get_cmap('tab20', self.num_classes)
            colored = cmap(mask.cpu().numpy())
            return Image.fromarray(colored[:, :, :3])
        
        for idx in range(n_imgs):
            # Original image
            img = images[idx].cpu().permute(1, 2, 0).numpy()
            img = np.clip((img * np.array(self.config["image_normalization"]["std"]) + 
                          np.array(self.config["image_normalization"]["mean"])), 0, 1)
            img = Image.fromarray(img)
            
            # Ground truth mask
            gt_mask = colorize_mask(masks[idx])
            
            # Predicted mask
            pred_mask = colorize_mask(predictions[idx])
            
            # Log to ClearML
            self.clearml_logger.report_image(title="Validation Images", series="Input Image", iteration=self.global_step, image=img)
            self.clearml_logger.report_image(title="Validation Images", series="GT Mask", iteration=self.global_step, image=gt_mask)
            self.clearml_logger.report_image(title="Validation Images", series="Predicted Mask", iteration=self.global_step, image=pred_mask)

    
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        
        if self.config["scheduler"] == "ReduceLROnPlateau":
            scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
                optimizer, 
                mode=self.config["scheduler_params"]["mode"],
                factor=self.config["scheduler_params"]["factor"], 
                patience=self.config["scheduler_params"]["patience"],
                verbose=True
            )
            return {
                "optimizer": optimizer,
                "lr_scheduler": {
                    "scheduler": scheduler,
                    "monitor": "val_loss",
                    "interval": "epoch",
                    "frequency": 1
                }
            }
        else:
            return optimizer

<div class="alert alert-info">
    
**🔹 Как это работает?**
1. `FCN-ResNet50` из `torchvision`:

* Мы берем предобученный `FCN-ResNet50`.
* Так же можем начать обучение с нуля.
* Выходной слой переделываем для `num_classes` (21).
  
2. `JaccardIndex` из `torchmetrics`:

* Берём `JaccardIndex` для расчёта `IoU` - метрика в задаче сегментации
* Аккумулируем метрику на каждом шаге, потом считаем среднее занчение на всей эпохе.
  
3. Логирование в `ClearML`:

* гиперпараметры сетки
* `train_loss`, `val_loss`
* `train_iou`, `val_iou`
* Изображения: входное, маска Ground Truth, предсказанная маска.


In [7]:
# ✅ Инициализация данных и модели
train_loader, val_loader = get_dataloaders(config["batch_size"])
model = SegmentationModel(config)

# ✅ Callbacks (сохранение лучшей модели, мониторинг LR)
callbacks = [
    ModelCheckpoint(monitor='val_iou', 
                    filename='fcn-resnet50-voc-{epoch:02d}-{val_iou:.3f}',
                    save_top_k=2,
                    mode='max',
                    save_last=True
                   ),
    LearningRateMonitor(logging_interval="step"),
]

Using downloaded and verified file: ./data/VOCtrainval_11-May-2012.tar
Extracting ./data/VOCtrainval_11-May-2012.tar to ./data
Using downloaded and verified file: ./data/VOCtrainval_11-May-2012.tar
Extracting ./data/VOCtrainval_11-May-2012.tar to ./data


Unsupported key of type '<class 'int'>' found when connecting dictionary. It will be converted to str


Train dataset size: 1464 samples
Validation dataset size: 1449 samples
2025-02-20 09:26:48,384 - clearml.model - INFO - Selected model id: f35a6f20500b43eebca2572247ae0168
Model Architecture: {'backbone': 'resnet50', 'head': 'FCN', 'pretrained': True, 'num_classes': 21}


In [19]:
#os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [8]:
# ✅ Тренировочный процесс
seed_everything(42)
trainer = Trainer(
    max_epochs=config["num_epochs"],
    accelerator="cpu" if torch.cuda.is_available() else "cpu",
    #devices="auto",
    callbacks=callbacks,
    log_every_n_steps=50,
)

Seed set to 42
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.



In [None]:
# ✅ Запуск обучения
trainer.fit(model, train_loader, val_loader)

In [10]:
# ✅ Не забываем закрыть таск
task.close()

### Посмотрим на валидационные сэмплы (вкладка `debug`)

<img src='../images/seg.png'>

## <center id="p6"> 🧸 Выводы и заключения ✅

<div class="alert alert-success">

В уроке рассмотрели способы как связка `Lightning + ClearML` поможет ускорить решение и отладку в CV-задачах:    
* Разобрали логирование задачи классификации изображений
* Файнтюнинг предобученного бэкбона
* Поняли как лучше организовать код для задачи сегментации
* В ДЗ попрактикуемся обучать GAN.