# **TEXT CLASSIFICATION WITH BERT**

Сначала проверим, что установлены все необходимые библиотеки

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!pip install transformers datasets evaluate accelerate

**Загрузка датасета**

Для анализа возьмем датасет IMDb с отзывами о фильмах и разметкой "0" для негативных отзывов и "1" для позитивных отзывов

In [None]:
from datasets import load_dataset

imdb = load_dataset("imdb")

In [None]:
imdb["test"][0]

In [None]:
imdb.shape

**Предобработка данных**

Вомпользуемся токенизатором DistilBERT и обработаем текст, чтобы входящие последовательности не превышали максимально допустимой длины

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [7]:
def preprocess_function(examples):
  return tokenizer(examples["text"], truncation=True)

Чтобы применить функции предобработки текста ко всему датасету, воспользуемся функцией [map](https://huggingface.co/docs/datasets/v2.16.1/en/package_reference/main_classes#datasets.Dataset.map), которую можно ускорить через обработку данных батчами.

In [5]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

С помощью функции [DataCollatorWithPadding](https://huggingface.co/docs/transformers/v4.37.0/en/main_classes/data_collator#transformers.DataCollatorWithPadding) создадим оптимальные батчи примеров.
При динамичном заполнении батча мы экономим память, посколько после выбора примеров для формирования батча заполнении коротких примеров происходит только до длины самого длинного примера в ТЕКУЩЕМ батче.
Визуализацию можно посмотреть [здесь](https://plainenglish.io/blog/understanding-collate-fn-in-pytorch-f9d1742647d3).

In [6]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

**ОЦЕНКА РЕЗУЛЬТАТОВ**

Загрузим метрики из библиотеки, и для данной задачи нам потребуется метрика accuracy.

In [7]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Функция подсчета метрики нам понадобится после обучения модели. В нее мы будем подавать предсказанные классы и сравнивать их с реальными классами в датасете.

In [8]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

**ТРЕНИРОВКА МОДЕЛИ**

Сперва преобразуем метки

In [9]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [10]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


В переменную `training_args` сохраним гиперпараметры модели и передадим эти конфигурации в объект класса Trainer для обучения. Метод train() запустит дообучение модели DistilBERT.



In [11]:
training_args = TrainingArguments(
    output_dir="my_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2247,0.228536,0.91724
2,0.1527,0.223055,0.93252


TrainOutput(global_step=3126, training_loss=0.20557194829978626, metrics={'train_runtime': 3271.7929, 'train_samples_per_second': 15.282, 'train_steps_per_second': 0.955, 'total_flos': 6564686875195392.0, 'train_loss': 0.20557194829978626, 'epoch': 2.0})

После дообучения модели можно попробовать сделать предсказания.
Для этого воспользуемся методом `pipeline`.

In [12]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [14]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="/content/my_model/checkpoint-3126")
classifier(text)

[{'label': 'POSITIVE', 'score': 0.9973165392875671}]

# **Дообучение модели RoBERTa**

In [27]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

In [28]:
roberta_tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [29]:
from transformers import DataCollatorWithPadding

roberta_data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [30]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [31]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [32]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [33]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
training_args = TrainingArguments(
    output_dir="/content/gdrive/MyDrive/my_model_roberta",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=roberta_tokenized_imdb["train"],
    eval_dataset=roberta_tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=roberta_data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()