# <center> <h1> 🧶 Связка `Lightning` + `ClearML` в задачах NLP. 🔤</h1> </center>

### Оглавление ноутбука
<img src='../images/nlp.webp' align="right" width="508" height="428" >
<br>

<p><font size="3" face="Arial" font-size="large"><ul type="square">
    
<li><a href="#p1">🧐 Посмотрим на связку в деле!</a></li>
<li><a href="#p2">☝️ 1-й способ: логируем через консоль</a></li>
<li><a href="#p7">✌️ 2-й способ: логирование через TensorBoardLogger</a></li>
<li><a href="#p5">🎚 Finetuning трансформера под свою задачу </a></li>
<li><a href="#p6">🧸 Выводы и заключения ✅ </a></li>


    
</ul></font></p>

## <center> 🧑‍🎓 Разберем связку **PyTorch Lightning** + **ClearML**


<div class="alert alert-info">

**ClearML** легко интегрируется с **PyTorch Lightning**, автоматически логируя модели PyTorch, параметры, и многое другое. Эта связка значимо упрощает работу в задачах с текстами. 

Все, что вам нужно сделать, это просто добавить две строки кода в ваш скрипт **PyTorch Lightning**:

<div class="alert alert-success">
    
```python
from clearml import Task

task = Task.init(task_name="<task_name>", project_name="<project_name>")
```

<div class="alert alert-info"> 

🤯 Вот и всё! Это создает эксперимент в **ClearML**, который фиксирует:

* Исходный код и несохраненные изменения
* Установленные пакеты
* Модели PyTorch
* Параметры, предоставляемые `LightningCLI`
* Всё, что мы отправляем в TensorBoard
* Весь выход консоли
* Общие сведения, такие как сведения о машине, время выполнения, дата создания и т. д.
* И многое другое

# <center id="p1">  🧐 Посмотрим на связку в деле!</center>

In [None]:
!pip install clearml tensorboard datasets -q

In [1]:
import os
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from dataclasses import asdict, dataclass

from lightning import LightningDataModule, LightningModule, Trainer
from lightning.pytorch.loggers import TensorBoardLogger

from clearml import Task, Logger

Вводим ключи ClearML со страницы https://app.clear.ml/settings/workspace-configuration

In [44]:
from getpass import getpass
# Введите поочерёдно полученные ключи в появившемся окне (код изменять не нужно)
access_key = getpass(prompt="Введите API Access токен: ")
secret_key = getpass(prompt="Введите API Secret токен: ")

Введите API Access токен:  ········
Введите API Secret токен:  ········


In [56]:
%%capture
#  Не показывать свои api-ключи
%env CLEARML_WEB_HOST=https://app.clear.ml/
%env CLEARML_API_HOST=https://api.clear.ml
%env CLEARML_FILES_HOST=https://files.clear.ml

%env CLEARML_API_ACCESS_KEY=$access_key
%env CLEARML_API_SECRET_KEY=$secret_key

<div class="alert alert-info"> 
    
Для примера возьмём датасет `AG NEWS`, в котором содержатся новостные заметки по различным тематикам. И натренируем нейросеть определять тематику новости.

In [2]:
url = 'https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv'
news = pd.read_csv(url, names=['label', 'title', 'text'])
news.head()

Unnamed: 0,label,title,text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [4]:
news.label.value_counts()

label
3    30000
4    30000
2    30000
1    30000
Name: count, dtype: int64

In [5]:
classes = ("World", "Sports", "Sci/Tec", "Business")

## <center> Создаём Dataset и Datamodule </center>

In [11]:
class TextDataset(Dataset):
    def __init__(self, csv_file, vocab=None, tokenizer=None):
        self.data = pd.read_csv(csv_file, names=['label', 'title', 'text'])
        self.tokenizer = tokenizer or (lambda x: x.split())
        self.vocab = vocab or self.build_vocab()

    def build_vocab(self):
        '''
        Создает словарь токенов с уникальными идентификаторами, начиная с "<unk>" (0) для неизвестных токенов. 
        Метод проходит по текстам в self.data['text'], токенизирует их и добавляет новые токены с индексами, 
        равными текущему размеру словаря.

        Возвращает:
            dict: Словарь токенов с их идентификаторами.
        ''' 
    
        vocab = {"<unk>": 0}
        for text in self.data['text']:
            for token in self.tokenizer(text):
                if token not in vocab:
                    vocab[token] = len(vocab)
        return vocab

    def encode_text(self, text):
        return torch.tensor([self.vocab.get(token, self.vocab["<unk>"]) for token in self.tokenizer(text)], dtype=torch.long)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        label = self.data.iloc[idx]['label'] - 1
        text = self.data.iloc[idx]['text']
        encoded_text = self.encode_text(text)
        return label, encoded_text, text

In [12]:
class TextDataModule(LightningDataModule):
    def __init__(self, train_csv, test_csv, batch_size=16, tokenizer=None):
        super().__init__()
        self.train_csv = train_csv
        self.test_csv = test_csv
        self.batch_size = batch_size
        self.tokenizer = tokenizer or (lambda x: x.split())

    def setup(self, stage=None):
        self.train_dataset = TextDataset(self.train_csv, tokenizer=self.tokenizer)
        self.test_dataset = TextDataset(self.test_csv, vocab=self.train_dataset.vocab, tokenizer=self.tokenizer)

        self.vocab = self.train_dataset.vocab
        self.num_classes = len(set(self.train_dataset.data['label']))

    def collate_fn(self, batch):
        labels, texts, origs = zip(*batch)
        offsets = torch.tensor([0] + [len(text) for text in texts[:-1]]).cumsum(dim=0)
        texts = torch.cat(texts)
        labels = torch.tensor(labels, dtype=torch.long)
        return texts, offsets, labels, origs

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True, collate_fn=self.collate_fn, num_workers=4)

    def val_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size, collate_fn=self.collate_fn, num_workers=4)


# <center id="p2"> ☝️ 1-й способ: логируем через консоль

<div class="alert alert-info"> 


Удобно смотреть на промежуточные итоги обучения модели - для мониторинга на каких примерах модель сильнее ошибается.

<div class="alert alert-info"> 
    
**Для этого:** В валидационный стэп добавляем промежуточное логирование текстов и предсказаний, чтобы они отображались в `ClearML` - можно в реальном времени следить как модель "умнеет", глядя на её предсказания.


Это можно сделать двумя способами:
* через консоль, используя `logger ClearML`
* через `debug samples`, используя `TensorBoardLogger`

Рассмотрим оба варианта!

In [13]:
# Пишем класс для модели в LightningModule
class TextSentimentModel(LightningModule):
    def __init__(self, vocab_size, embed_dim, num_class, learning_rate=1.0, logger=None, batch_size=48):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.criterion = nn.CrossEntropyLoss()
        self.learning_rate = learning_rate
        self.loggs = logger # добавляем внешний логгер
        self.batch_size = batch_size
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

    def training_step(self, batch, batch_idx):
        text, offsets, labels, origs = batch
        outputs = self.forward(text, offsets)
        loss = self.criterion(outputs, labels)
        self.log("train_loss", loss, batch_size=self.batch_size)
        return loss

    def validation_step(self, batch, batch_idx):
        text, offsets, labels, origs = batch
        outputs = self.forward(text, offsets)
        loss = self.criterion(outputs, labels)
        acc = (outputs.argmax(1) == labels).float().mean()

        # Логируем validation loss and accuracy в прогресс бар
        self.log("val_loss", loss, prog_bar=True, batch_size=self.batch_size)
        self.log("val_acc", acc, prog_bar=True, batch_size=self.batch_size)

        # Логируем тестовые сэмплы
        if logger and batch_idx == 0:  # Логируем только для первого батча
            predictions = outputs.argmax(1)
            print(f"Val_predictions for epoch {self.current_epoch}:")
            for i in range(min(5, len(labels))):
                    self.loggs.report_text(
                    f'''Text: {origs[i]}; 
                    Prediction: {classes[predictions[i].item() - 1]}; 
                    True Label: {classes[labels[i].item() - 1]}'''
                )

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=self.learning_rate)
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.9)
        return [optimizer], [scheduler]

In [14]:
# Создадим конфиг
@dataclass
class CFG:
    project_name: str = "TextClassification"
    task_name: str = "AG_NEWS Text Classification"
    train_csv: str = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"
    test_csv: str = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv"
    batch_size: int = 48
    learning_rate: float = 0.01
    seed: int = 2024
    device: str = 'cpu'  #"cuda"
    embed_dim: int = 32
    epochs: int = 3

# Чтобы сохранить конфигурацию текущего эксперимента, перенесём её в словарь
cfg = CFG()
configuration_dict = asdict(cfg)
configuration_dict

{'project_name': 'TextClassification',
 'task_name': 'AG_NEWS Text Classification',
 'train_csv': 'https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv',
 'test_csv': 'https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv',
 'batch_size': 48,
 'learning_rate': 0.01,
 'seed': 2024,
 'device': 'cpu',
 'embed_dim': 32,
 'epochs': 3}

In [15]:
# Initialize ClearML task
task = Task.init(project_name=cfg.project_name, task_name=cfg.task_name)

ClearML Task: created new task id=488c8c3a113c4126bec2618c53ebc2bb
2025-02-12 21:53:14,851 - clearml.Task - INFO - Storing jupyter notebook directly as code
ClearML results page: https://app.clear.ml/projects/651ec2987cb34c0c9613bda5583f55e9/experiments/488c8c3a113c4126bec2618c53ebc2bb/output/log


In [16]:
logger = Logger.current_logger()

In [18]:
# Инициализируем DataModule
data_module = TextDataModule(cfg.train_csv, cfg.test_csv, batch_size=cfg.batch_size)
data_module.setup()

In [20]:
# Логируем конфиг в ClearML
cfg.vocab_size = len(data_module.vocab)
cfg.num_class = data_module.num_classes
configuration_dict = task.connect(asdict(cfg))
print(configuration_dict)  # printing actual configuration (after override in remote mode)

{'project_name': 'TextClassification', 'task_name': 'AG_NEWS Text Classification', 'train_csv': 'https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv', 'test_csv': 'https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv', 'batch_size': 48, 'learning_rate': 0.01, 'seed': 2024, 'device': 'cpu', 'embed_dim': 32, 'epochs': 3}


In [21]:
# Инициализируем модель
model = TextSentimentModel(cfg.vocab_size, cfg.embed_dim, cfg.num_class, cfg.learning_rate, logger=logger)

In [23]:
trainer = Trainer(
    max_epochs=cfg.epochs,
    accelerator=cfg.device,
)

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


<div class="alert alert-info"> 
    
Запускаем тренировку, видим, что каждую эпоху в консоль пишутся результаты инференса модели на валидационных сэмплах.

In [24]:
# Training
trainer.fit(model, datamodule=data_module)


  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | EmbeddingBag     | 5.0 M  | train
1 | fc        | Linear           | 132    | train
2 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
5.0 M     Trainable params
0         Non-trainable params
5.0 M     Total params
19.974    Total estimated model params size (MB)
3         Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Val_predictions for epoch 0:
Text: ----- NASHVILLE, Tennessee (Ticker) - The Nashville Predators signed defenseman Ryan Suter, their first-round pick in the 2003 draft, on Thursday.; 
                    Prediction: Sports; 
                    True Label: World
Text: Strong international sales growth and solid U.S. comps propel the company's stock to its highest price ever.; 
                    Prediction: Sci/Tec; 
                    True Label: Sports
Text: The Russian president puts some blame on his international critics -- and supports president Bush; 
                    Prediction: Business; 
                    True Label: Business
Text: The Net needs a new layer of abilities that will deal with imminent problems of capacity, security and reliability, Intel's CTO says.; 
                    Prediction: Sports; 
                    True Label: Sci/Tec
Text: BOSTON - Computer viruses and worms will have to share the stage with a new challenger for the attention of attendees at

Validation: |          | 0/? [00:00<?, ?it/s]

Val_predictions for epoch 1:
Text: ----- NASHVILLE, Tennessee (Ticker) - The Nashville Predators signed defenseman Ryan Suter, their first-round pick in the 2003 draft, on Thursday.; 
                    Prediction: Sports; 
                    True Label: World
Text: Strong international sales growth and solid U.S. comps propel the company's stock to its highest price ever.; 
                    Prediction: Sci/Tec; 
                    True Label: Sports
Text: The Russian president puts some blame on his international critics -- and supports president Bush; 
                    Prediction: Business; 
                    True Label: Business
Text: The Net needs a new layer of abilities that will deal with imminent problems of capacity, security and reliability, Intel's CTO says.; 
                    Prediction: Sports; 
                    True Label: Sci/Tec
Text: BOSTON - Computer viruses and worms will have to share the stage with a new challenger for the attention of attendees at

Validation: |          | 0/? [00:00<?, ?it/s]

Val_predictions for epoch 2:
Text: ----- NASHVILLE, Tennessee (Ticker) - The Nashville Predators signed defenseman Ryan Suter, their first-round pick in the 2003 draft, on Thursday.; 
                    Prediction: Business; 
                    True Label: World
Text: Strong international sales growth and solid U.S. comps propel the company's stock to its highest price ever.; 
                    Prediction: Sci/Tec; 
                    True Label: Sports
Text: The Russian president puts some blame on his international critics -- and supports president Bush; 
                    Prediction: Business; 
                    True Label: Business
Text: The Net needs a new layer of abilities that will deal with imminent problems of capacity, security and reliability, Intel's CTO says.; 
                    Prediction: Sports; 
                    True Label: Sci/Tec
Text: BOSTON - Computer viruses and worms will have to share the stage with a new challenger for the attention of attendees 

`Trainer.fit` stopped: `max_epochs=3` reached.


<div class="alert alert-info"> 

Проверим как модель предсказывает на случайном примере из теста!

In [25]:
def predict(text, model, vocab, tokenizer):
    model.eval()
    with torch.no_grad():
        tokens = torch.tensor([vocab.get(token, vocab["<unk>"]) for token in tokenizer(text)], dtype=torch.long)
        offsets = torch.tensor([0])
        output = model(tokens, offsets)
        prediction = output.argmax(1).item()
        return prediction

In [26]:
# Load a random example from the test dataset
random_idx = torch.randint(0, len(data_module.test_dataset), (1,)).item()
example_label, example_text, orig = data_module.test_dataset[random_idx]
predicted_label = predict(orig, model, data_module.vocab, data_module.tokenizer)

print(f"Text: {orig}")
print(f"True Label: {example_label}")
print(f"Predicted Label: {predicted_label}")

Text: Javy Lopez drives in four runs, Daniel Cabrera becomes the first rookie to win 10 games this season, and the Orioles hold Tampa Bay to two hits in an 8-0 victory Wednesday night.
True Label: 1
Predicted Label: 1


In [27]:
# Не забываем завершить эксперимент
task.close()


    
## <center> Переходим в UI от ClearML

Cмотрим как залогировались наши сэмплы - отслеживаем прогресс обучения.

<div class="alert alert-info"> 

<img src='../images/clnlp.png'> 

## <center id="p7"> ✌️ 2-й способ: логирование через `TensorBoardLogger`

<div class="alert alert-info"> 
    
* Более лаконичный код
* Меньше мусора в консоли
* Легче найти валидационные сэмплы на отдельной вкладке
* Не нужно разгребать всю консоль и искать результаты инференса среди других сообщений в консоли.

In [28]:
class TextSentimentModel(LightningModule):
    def __init__(self, vocab_size, embed_dim, num_class, learning_rate=1.0, batch_size=48):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.criterion = nn.CrossEntropyLoss()
        self.learning_rate = learning_rate
        self.batch_size = batch_size
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

    def training_step(self, batch, batch_idx):
        text, offsets, labels, origs = batch
        outputs = self.forward(text, offsets)
        loss = self.criterion(outputs, labels)
        self.log("train_loss", loss, batch_size=self.batch_size)
        return loss

    def validation_step(self, batch, batch_idx):
        text, offsets, labels, origs = batch
        outputs = self.forward(text, offsets)
        loss = self.criterion(outputs, labels)
        acc = (outputs.argmax(1) == labels).float().mean()

        # Логируем validation loss and accuracy в прогресс бар
        self.log("val_loss", loss, prog_bar=True, batch_size=self.batch_size)
        self.log("val_acc", acc, prog_bar=True, batch_size=self.batch_size)

        # # Логируем тестовые сэмплы
        if batch_idx == 0:  # Логируем только для первого батча
            predictions = outputs.argmax(1)
            for i in range(min(5, len(labels))):
                    self.logger.experiment.add_text(
                "val_predictions",
                f'''Text: {origs[i]}; 
                Prediction: {classes[predictions[i].item() - 1]}; 
                True Label: {classes[labels[i].item() - 1]}''',
                self.global_step
            )


    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=self.learning_rate)
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.9)
        return [optimizer], [scheduler]

In [29]:
# Initialize ClearML task
task = Task.init(project_name=cfg.project_name, 
                 task_name=cfg.task_name, 
                 auto_connect_frameworks={'tensorboard': True}) # Добавляем для отображения сообщений TensorBoard в ClearML

ClearML Task: created new task id=c9b914e959bb4a398592e2f0bea89352
ClearML results page: https://app.clear.ml/projects/651ec2987cb34c0c9613bda5583f55e9/experiments/c9b914e959bb4a398592e2f0bea89352/output/log


In [30]:
model = TextSentimentModel(cfg.vocab_size, cfg.embed_dim, cfg.num_class, cfg.learning_rate)

In [31]:
# Логгер и тренер
logger = TensorBoardLogger("./lightning_logs", name="text_classification")
trainer = Trainer(
    max_epochs=cfg.epochs,
    logger=[logger],
    accelerator=cfg.device,
)

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.



In [32]:
trainer.fit(model, datamodule=data_module)


  | Name      | Type             | Params | Mode 
-------------------------------------------------------
0 | embedding | EmbeddingBag     | 5.0 M  | train
1 | fc        | Linear           | 132    | train
2 | criterion | CrossEntropyLoss | 0      | train
-------------------------------------------------------
5.0 M     Trainable params
0         Non-trainable params
5.0 M     Total params
19.974    Total estimated model params size (MB)
3         Modules in train mode
0         Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


In [33]:
# Не забываем завершить эксперимент
task.close()

<div class="alert alert-success">
    
Посмотрим на вкладку `debug samples`

<img src='../images/debnlp.png'>

<div class="alert alert-success">
    
Посмотрим как выглядит отдельный сэмпл
(откроется в отдельном окне, если кликнуть по нему)

<img src='../images/smpnlp.png'>

# <center id="p5"> 🎚 Finetuning трансформера под свою задачу </center>

<div class="alert alert-info"> 
    
Рассмотрим ещё один популярный юзкейс - дообучение (файнтюнинг) трансформера под свою задачу в связке `Lightning + ClearML`!

Возьмём датасет `IMBD` с отзывами о фильмах и дообучим `DistilBert` под задачу определения характера отзыва.

In [2]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from datasets import load_dataset

In [3]:
class IMDBDataModule(LightningDataModule):
    def __init__(self, model_name="distilbert-base-uncased", batch_size=16):
        super().__init__()
        self.tokenizer = DistilBertTokenizer.from_pretrained(model_name)
        self.batch_size = batch_size

    def prepare_data(self):
        # Загружаем датасет IMDB
        load_dataset("imdb")

    def setup(self, stage=None):
        dataset = load_dataset("imdb")
        self.train_data = dataset["train"]
        self.test_data = dataset["test"]

        def tokenize_function(examples):
            return self.tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

        self.train_data = self.train_data.map(tokenize_function, batched=True)
        self.test_data = self.test_data.map(tokenize_function, batched=True)

        self.train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label", "text"])
        self.test_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label", "text"])

    def train_dataloader(self):
        return DataLoader(self.train_data, batch_size=self.batch_size, shuffle=True, num_workers=4)

    def val_dataloader(self):
        return DataLoader(self.test_data, batch_size=self.batch_size, num_workers=4)


In [4]:
class DistilBertClassifier(LightningModule):
    def __init__(self, model_name="distilbert-base-uncased", learning_rate=2e-5):
        super().__init__()
        self.save_hyperparameters()
        self.model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids, attention_mask=attention_mask).logits

    def training_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], batch["attention_mask"])
        loss = self.loss_fn(outputs, batch["label"])
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], batch["attention_mask"])
        loss = self.loss_fn(outputs, batch["label"])
        acc = (outputs.argmax(1) == batch["label"]).float().mean()
        
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)
        
        # Логируем предсказания в TensorBoard
        if batch_idx == 0:
            for i in range(min(5, len(batch["label"]))):
                self.logger.experiment.add_text(
                    "val_predictions",
                    f'''Text: {batch['text'][i]}; 
                    Prediction: {outputs.argmax(1)[i].item()}; 
                    True Label: {batch['label'][i].item()}''',
                    self.global_step
                )

    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr=self.hparams.learning_rate)
        return optimizer

In [5]:
@dataclass
class CFG:
    project_name: str = "TextClassification"
    task_name: str = "Fine-tune DistilBERT"
    batch_size: int = 32
    learning_rate: float = 2e-5
    seed: int = 2024
    device: str = 'cuda'  #"cuda"
    epochs: int = 3

# Чтобы сохранить конфигурацию текущего эксперимента, перенесём её в словарь
cfg = CFG()
configuration_dict = asdict(cfg)
configuration_dict

{'project_name': 'TextClassification',
 'task_name': 'Fine-tune DistilBERT',
 'batch_size': 32,
 'learning_rate': 2e-05,
 'seed': 2024,
 'device': 'cuda',
 'epochs': 3}

In [6]:
# Инициализация задачи в ClearML
task = Task.init(project_name=cfg.project_name, 
                 task_name=cfg.task_name, 
                 auto_connect_frameworks={'tensorboard': True})

ClearML Task: created new task id=60fc42b4e08a4532af10aa834f9cef9a
2025-02-12 23:40:48,025 - clearml.Task - INFO - Storing jupyter notebook directly as code
ClearML results page: https://app.clear.ml/projects/651ec2987cb34c0c9613bda5583f55e9/experiments/60fc42b4e08a4532af10aa834f9cef9a/output/log
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start


In [7]:
data_module = IMDBDataModule(batch_size=cfg.batch_size)
data_module.prepare_data()
data_module.setup()

# Инициализация модели
model = DistilBertClassifier(learning_rate=cfg.learning_rate)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
logger = TensorBoardLogger("lightning_logs", name="distilbert_imdb")
trainer = Trainer(max_epochs=cfg.epochs, 
                  accelerator=cfg.device, 
                  logger=logger)

Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [9]:
trainer.fit(model, datamodule=data_module)

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]



  | Name    | Type                                | Params | Mode 
------------------------------------------------------------------------
0 | model   | DistilBertForSequenceClassification | 67.0 M | eval 
1 | loss_fn | CrossEntropyLoss                    | 0      | train
------------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
1         Modules in train mode
96        Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=3` reached.


<div class="alert alert-info"> 

Проверим инференс модели на случайной фразе!

In [47]:
def predict(text, model, tokenizer):
    model.eval()
    with torch.no_grad():
        tokens = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
        output = model(tokens["input_ids"], tokens["attention_mask"])
        prediction = output.argmax(1).item()
        return "Positive" if prediction == 1 else "Negative"

In [48]:
sample_text = "This movie was absolutely fantastic! The story, the acting, everything was perfect."
predicted_label = predict(sample_text, model, data_module.tokenizer)

print(f"Text: {sample_text}")
print(f"Predicted Sentiment: {predicted_label}")

Text: This movie was absolutely fantastic! The story, the acting, everything was perfect.
Predicted Sentiment: Positive


In [10]:
task.close()

<div class="alert alert-success">
    
Можем перейти в `WebUI ClearML` и посмотреть на логи и `debug samples`.

<img src='../images/debtr.png'>

## <center id="p6"> 🧸 Выводы и заключения ✅

<div class="alert alert-success">
    
В уроке рассмотрели способы как связка `Lightning + ClearML` поможет ускорить решение и отладку NLP-задач:
* Попробовали 2 способа логирования отладочных сэмплов
* Применили связку в задаче файнтюнинга трансформера
* В практическом задании попрактикуетесь в файнтюнинге LLM!

<div class="alert alert-info">
    
Связка `Lightning + ClearML` делает разработку NLP-моделей быстрее и удобнее:

* `Lightning` = упрощённое обучение
* `ClearML` = мониторинг и автоматизация

💡 Если вы работаете с NLP и PyTorch — этот стек ускорит ваш пайплайн и упростит отладку! 🚀