In [None]:
pip install transformers

In [None]:
pip install sentencepiece

## 1. Pipelines

In [14]:
from transformers import pipeline
import torch

[Документация по transformers.pipeline](https://huggingface.co/transformers/main_classes/pipelines.html)

[Model hub](https://huggingface.co/models)

1.1 Среди предобученных моделей найдите модель для перевода текста с русского языка на английский. Протестируйте данную модель на нескольких предложениях, используя `transformers.pipeline`. Выведите результаты работы в следующем виде:

```
sentence1_ru -> sentence1_en
sentence2_ru -> sentence2_en
```


In [2]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-ru-en")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/307M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/803k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/2.60M [00:00<?, ?B/s]



In [3]:
sentences_ru = [
    "Привет, как дела?",
    "Я люблю гулять в парке.",
    "Какой сегодня день недели?",
]

for sentence_ru in sentences_ru:
    sentence_en = translator(sentence_ru, max_length=40)[0]["translation_text"]
    print(f"{sentence_ru} -> {sentence_en}")

Привет, как дела? -> Hey, how's it going?
Я люблю гулять в парке. -> I like to walk in the park.
Какой сегодня день недели? -> What day of the week is it?


1.2 Среди предобученных моделей найдите модель для поиска ответа в тексте. Протестируйте данную модель на нескольких предложениях, используя `transformers.pipeline`. Выведите на экран результаты в следующем виде:

```
Q: ...
A: ...
Q: ...
A: ...
```

In [5]:
qa = pipeline("question-answering", model="deepset/roberta-base-squad2", tokenizer="deepset/roberta-base-squad2")

context = """
The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
"""

questions = [
    "When did the history of NLP start?",
    "Who proposed the Turing test?",
    "What did Alan Turing publish in 1950?",
]

for question in questions:
    answer = qa(question=question, context=context)["answer"]
    print(f"Q: {question}")
    print(f"A: {answer}")

Q: When did the history of NLP start?
A: 1950s
Q: Who proposed the Turing test?
A: Alan Turing
Q: What did Alan Turing publish in 1950?
A: Computing Machinery and Intelligence


1.3 Среди предобученных моделей найдите модель для классификации тональности русскоязычного текста (позитивный/негативный или позитивный/негативный/нейтральный). Протестируйте данную модель на нескольких предложениях, используя `transformers.pipeline`. Выведите результаты работы в следующем виде:

```
sentence1 -> class1
sentence2 -> class2
...
```

In [10]:
classifier = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

sentences = [
    "Этот фильм был просто ужасен!",
    "Я очень доволен результатами своей работы.",
    "Мне очень не понравилось это место!",
    "Я не могу сказать, что мне нравится этот продукт.",
]

for sentence in sentences:
    result = classifier(sentence)[0]
    label = result["label"]
    score = result["score"]
    print(f"{sentence} -> {label} ({score:.2f})")

Этот фильм был просто ужасен! -> 1 star (0.40)
Я очень доволен результатами своей работы. -> 5 stars (0.61)
Мне очень не понравилось это место! -> 1 star (0.46)
Я не могу сказать, что мне нравится этот продукт. -> 3 stars (0.30)


## 2. Токенизаторы и модели

[Auto Classes](https://huggingface.co/transformers/model_doc/auto.html)

[Tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=tokenizer#transformers.PreTrainedTokenizer.__call__)

2.1 Решите задачу 1.2, создав объект токенизатора (`transformers.AutoTokenizer`) и модель (`transformers.AutoModelForQuestionAnswering`).

In [15]:
from transformers import AutoTokenizer
from transformers import AutoModelForQuestionAnswering


tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

context = """
The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
"""

questions = [
    "When did the history of NLP start?",
    "Who proposed the Turing test?",
    "What did Alan Turing publish in 1950?",
]

for question in questions:
    inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end]))
    print(f"Q: {question}")
    print(f"A: {answer}")

Q: When did the history of NLP start?
A:  1950s
Q: Who proposed the Turing test?
A:  Alan Turing
Q: What did Alan Turing publish in 1950?
A: Computing Machinery and Intelligence


2.2 Решите задачу 1.3, создав объект токенизатора (`transformers.AutoTokenizer`) и модель (`transformers.AutoModelForSequenceClassification`).

In [19]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

sentences = [
    "Этот фильм был просто ужасен!",
    "Я очень доволен результатами своей работы.",
    "Мне очень не понравилось это место!",
    "Я не могу сказать, что мне нравится этот продукт.",
]

encoded_sentences = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
outputs = model(**encoded_sentences)
predictions = outputs.logits.argmax(dim=1)

label_map = {0: "1 star", 1: "2 stars", 2: "3 stars", 3: "4 stars", 4: "5 stars"}
predicted_labels = [label_map[prediction.item()] for prediction in predictions]

print(predicted_labels)

['1 star', '5 stars', '1 star', '3 stars']


# 3. Fine tuning

3.1 Дообучите классификатор отзывов на основе модели `distilbert-base-uncased`.

Датасет: https://yadi.sk/d/mRXgc2aJSCncdw

* считайте данные, разбейте на обучающее и тестовое множество;
* создайте токенизатор `AutoTokenizer` для модели `distilbert-base-uncased` и преобразуйте с его помощью текстовые данные. Не забудьте выровнять длину всех последовательностей при помощи параметра `padding`;
* опишите класс `ReviewDataset`:
  * в данном случае удобнее, чтобы метод `__getitem__` возвращал словарь, а не кортеж (см. класс `MyDataset` ниже). Этот словарь должен содержать все данные, полученные после работы токенизатора плюс по ключу `label` должен находиться правильный ответ;
* создайте модель `AutoModelForSequenceClassification` с предобученными весами на основе `distilbert-base-uncased`;
  * при создании модели укажите параметр `num_labels=2`
* дообучите модель:
  * удобная особенность моделей из `transformers`: в метод `__call__` модели можно передать параметр `labels`, содержащий правильные ответы для обучения; тогда в словаре, который вернет метод `__call__` будет ключ `loss`, содержащий тензор со значением функции потерь, у которого можно вызвать метод `backward` и т.д. Таким образом, в данном случае функцию потерь объявлять не нужно;
  * для обучения используйте оптимизатор `transformers.AdamW` вместо `torch.optim.Adam`;
* измерьте значение accuracy на тестовом множестве.

In [7]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [8]:
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AdamW, BertTokenizerFast, DistilBertTokenizer, AutoTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# tokenizer = AutoTokenizer.from_pretrained("siebert/sentiment-roberta-large-english")

# model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

In [1]:
import re
import nltk

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import Dataset, DataLoader
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/noble6/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
with open("data/polarity/positive_reviews.txt") as f:
    positive_reviews = sent_tokenize(f.read())
    
with open("data/polarity/negative_reviews.txt") as f:
    negative_reviews = sent_tokenize(f.read())

In [3]:
reviews_df = pd.DataFrame()

reviews_df["text"] = positive_reviews + negative_reviews
reviews_df["category"] = [1 for i in range(len(positive_reviews))] + [0 for i in range(len(negative_reviews))]

reviews_df = reviews_df
reviews_df

Unnamed: 0,text,category
0,"simplistic , silly and tedious .",1
1,"it's so laddish and juvenile , only teenage bo...",1
2,exploitative and largely devoid of the depth o...,1
3,[garbus] discards the potential for pathologic...,1
4,a visually flashy but narratively opaque and e...,1
...,...,...
11872,may prove to be [tsai's] masterpiece .,0
11873,mazel tov to a film about a family's joyous li...,0
11874,standing in the shadows of motown is the best ...,0
11875,it's nice to see piscopo again after all these...,0


In [33]:
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
    text = text.lower()
    text = ''.join([' ' if not char.isalpha() and char not in ['.', ',', '!', '?'] else char for char in text])
    
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    preprocessed_text = ' '.join(lemmatized_tokens)
    
    return preprocessed_text

# reviews_df["text"] = reviews_df["text"].apply(lambda x: preprocess_text(x))
corpus = reviews_df["text"].apply(lambda x: preprocess_text(x))

In [5]:
import string
import spacy
from tqdm import tqdm
from multiprocessing import Pool
from spacy.lang.en.stop_words import STOP_WORDS as EN_STOP_WORDS


SPEECH_PARTS = ['NOUN', 'ADJ', 'VERB', 'ADV', 'PART', 'INTJ']
nlp = spacy.load("en_core_web_lg")

def preprocess_text(text):
    # text = " ".join([char for char in text if char not in string.punctuation])
    # text = " ".join(text).lower()
    # print(text)
    text = ''.join([' ' if not char.isalpha() and char not in string.punctuation else char for char in text])
    # text = ''.join([' ' if not char.isalpha() else char for char in text])
    # print(text)
    doc = nlp(text)
    lemmatized_tokens = []
    # word_vectors = []
    for token in doc:
        if token.is_alpha and token.text not in EN_STOP_WORDS:
            lemma = token.lemma_
            pos = token.pos_
            # Если токен - известное модели слово
            if pos in SPEECH_PARTS:
                lemmatized_tokens.append(lemma)
                # word_vector = [token.vector for token in doc]
                # word_vectors.append(word_vector)
                # lemmatized_tokens.append(lemma)
    
    preprocessed_text = ' '.join(lemmatized_tokens)
    
    return preprocessed_text


# corpus = corpus[:1000].apply(lambda x: preprocess_text(x))

def preprocess_corpus(corpus):
    with Pool(processes=8) as pool:
        results = []
        for result in tqdm(pool.imap_unordered(preprocess_text, corpus), total=len(corpus)):
            results.append(result)
    return results

corpus = preprocess_corpus(reviews_df["text"])

2023-05-31 23:22:50.283771: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-31 23:22:50.289964: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-31 23:22:50.290154: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
100%|██████████| 11877/11877 [00:22<00:00, 524.17it/s]


In [54]:
from sklearn.model_selection import train_test_split

X = corpus
y = reviews_df['category']
n_classes = y.nunique()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [4]:
from sklearn.model_selection import train_test_split

X = reviews_df['text']
y = reviews_df['category']
n_classes = y.nunique()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [5]:
from multiprocessing.pool import ThreadPool
from tqdm import tqdm

class TransDataset(Dataset):
    def __init__(self, text, labels, tokenizer):
        self.text = text
        self.labels = labels
        self.tokenizer = tokenizer
        t = []
        # Бьем текст на токены 
        with ThreadPool(8) as pool:
            for tokenized_input in tqdm(pool.imap_unordered(self.tokenize, text, chunksize=1000), total=len(text)):
                t.append(tokenized_input)
        self.text = t

    def tokenize(self, text):
        return self.tokenizer(text,
                        add_special_tokens=True,
                        max_length=60,
                        truncation=True,
                        padding='max_length',
                        return_attention_mask=True,
                        return_tensors="pt")
        
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        text = self.text
        return text[idx]['input_ids'].flatten(), text[idx]['attention_mask'].flatten(), torch.tensor(self.labels[idx], dtype=torch.long)
        # text = self.text[idx]
        # label = self.labels[idx]
        # tokenized_input = self.tokenize(text)
        # return {'input_ids': text[idx]['input_ids'].squeeze(0),
        #         'attention_mask': text[idx]['attention_mask'].squeeze(0),
        #         'label': torch.tensor(self.labels[idx], dtype=torch.long)}
        

In [None]:
class MyDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

    def __len__(self):
        return len(self.labels)

In [9]:
train_dataset = TransDataset(X_train, y_train, tokenizer)
test_dataset = TransDataset(X_test, y_test, tokenizer)

100%|██████████| 9501/9501 [00:02<00:00, 3283.22it/s]
100%|██████████| 2376/2376 [00:00<00:00, 72260.53it/s]


In [12]:
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=8)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=8)

In [13]:
next(iter(train_dataloader))[0][0]

tensor([ 101, 1037, 3142, 3775, 2278, 4038, 1010, 2748, 1010, 2021, 2028, 2007,
        3494, 2040, 2228, 1998, 2831, 2055, 2037, 3289, 1010, 1998, 2024, 2551,
        2006, 2524, 3247, 2015, 1012,  102,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0])

In [12]:
from sklearn.metrics import classification_report
import pytorch_lightning as pl




In [69]:

class DistilBertClassifier(pl.LightningModule):
    def __init__(self, model_name):
        super().__init__()
        # self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        # self.model = BertForSequenceClassification(config)
        # for param in self.model.parameters():
        #     # print(param)
        #     param.requires_grad = False
        # print(self.model.classifier)
        # self.model.classifier.requires_grad = True
        for name, param in self.model.named_parameters():
            if 'classifier' not in name:  # Имя параметра содержит 'classifier'
                param.requires_grad = False
            
        # self.classifier = nn.Sequential(
        #     nn.Linear(768, 256),
        #     nn.BatchNorm1d(256),
        #     nn.ReLU(),
        #     nn.Linear(256, num_classes)
        # )
        # self.model.classifier = nn.Sequential(
        #     nn.Linear(120, 128),
        #     nn.BatchNorm1d(128),
        #     nn.ReLU(),
        #     nn.Linear(128, num_classes)
        # )
        # self.classifier = nn.Linear(768, num_classes)
        
    def forward(self, input_ids, attention_mask, label):
        print(attention_mask, label)
        outputs = self.model(input_ids, attention_mask, label)
        # print(outputs)
        
        # logits = self.classifier(outputs.logits)
        return outputs.logits
        
    def training_step(self, batch, batch_idx):
        # print(batch.shape)
        # print(batch)
        input_ids = batch[0] # input_ids
        attention_mask = batch[1] # attention_mask
        labels = batch[2] # labels
        
        # outputs = self(input_ids, attention_mask)
        # # logits = 
        # loss = nn.CrossEntropyLoss()(outputs.logits, labels)
        # # loss = outputs.loss
        
        # self.log('train_loss', loss)
        # return loss
        
        outputs = self(input_ids)
        # print(f"train_step: {outputs}")
        # loss = outputs.loss
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        self.log('train_loss', loss, prog_bar=True)
        return loss
        
    def validation_step(self, batch, batch_idx):
        # print(batch[0], batch[1], batch[2])
        # print(batch.shape)
        # print(batch)
        input_ids = batch[0] # input_ids
        attention_mask = batch[1] 
        labels = batch[2]
        
        # outputs = self(input_ids, attention_mask)
        
        
        # preds = torch.argmax(outputs.logits, dim=1)
        # acc = (preds == labels).float().mean()
        
        # self.log('val_loss', loss)
        # self.log('val_acc', acc)
        outputs = self(input_ids, attention_mask, labels)
        # loss = outputs.loss
        # logits = outputs.logits
        # print(f"val_step: {outputs}")
        loss = outputs.loss
        loss.backward()
        # loss = nn.CrossEntropyLoss()(outputs, labels)
        
        preds = torch.argmax(outputs, dim=1)
        acc = (preds == labels).float().mean()
        
        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)
        
    def test_step(self, batch, batch_idx):
        input_ids = batch[0]
        # attention_mask = batch[1]
        labels = batch[2]
        
        # outputs = self(input_ids, attention_mask)
        # logits = outputs.logits
        
        # preds = torch.argmax(logits, dim=1)
        # acc = (preds == labels).float().mean()
        
        # self.log('test_acc', acc)
        outputs = self(input_ids)
        logits = outputs
        
        preds = torch.argmax(logits, dim=1)
        acc = (preds == labels).float().mean()
        
        # вычисляем precision, recall и f1-score
        precision, recall, f1, support = classification_report(labels.cpu(), preds.cpu(), output_dict=True)["weighted avg"].values()
        
        self.log('test_acc', acc, prog_bar=True)
        self.log('test_precision', precision, prog_bar=True)
        self.log('test_recall', recall, prog_bar=True)
        self.log('test_f1', f1, prog_bar=True)
        
        
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=0.003)
        return optimizer

In [13]:
class DistilBertClassifier(pl.LightningModule):
    def __init__(self, model_name, num_classes):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
        # for name, param in self.model.named_parameters():
        #     if 'classifier' not in name:
        #         param.requires_grad = False
            
    def forward(self, input_ids, attention_mask):
        output = self.model(input_ids=input_ids, attention_mask=attention_mask)
        return output.logits
        
    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        output = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = output.loss
        self.log('train_loss', loss, prog_bar=True)
        return loss
        
    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        output = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = output.loss
        preds = output.logits.argmax(dim=-1)
        acc = accuracy_score(preds.cpu(), labels.cpu())
        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)
        
    def test_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        output = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = output.loss
        preds = output.logits.argmax(dim=-1)
        acc = accuracy_score(preds, labels)
        precision, recall, f1, support = classification_report(labels.cpu(), preds.cpu(), output_dict=True)["weighted avg"].values()
        self.log('test_loss', loss, prog_bar=True)
        self.log('test_acc', acc, prog_bar=True)
        self.log('test_precision', precision, prog_bar=True)
        self.log('test_recall', recall, prog_bar=True)
        self.log('test_f1', f1, prog_bar=True)
        
    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr=0.001)
        return optimizer

In [14]:
model = DistilBertClassifier('distilbert-base-uncased', num_classes=2)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classi

In [61]:
model = DistilBertClassifier("siebert/sentiment-roberta-large-english", num_classes=2)


In [25]:
model = BertClassifier('DeepPavlov/rubert-base-cased', num_classes=2)

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'bert.encoder.layer.4.attention.self.key.weight', 'bert.encoder.layer.3.attention.self.value.weight', 'bert.encoder.layer.5.attention.self.key.bias', 'bert.encoder.layer.7.output.LayerNorm.weight', 'bert.encoder.layer.4.output.dense.bias', 'bert.encoder.layer.7.attention.output.LayerNorm.bias', 'cls.predictions.bias', 'bert.encoder.layer.8.attention.self.value.weight', 'bert.encoder.layer.8.attention.self.key.weight', 'bert.encoder.layer.6.attention.self.key.weight', 'bert.encoder.layer.7.attention.self.key.bias', 'bert.encoder.layer.8.attention.output.dense.bias', 'bert.encoder.layer.9.attention.self.value.weight', 'bert.encoder.layer.11.output.LayerNorm.weight', 'bert.encoder.layer.7.output.LayerNorm.bias', 'bert.encoder.layer.4.attention.output.LayerNorm.weight', 'bert.encoder.layer.11.attention.output.LayerN

In [15]:
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Количество обучаемых параметров: {total_params}")

Количество обучаемых параметров: 66955010


In [16]:
trainer = pl.Trainer(max_epochs=4, accelerator="gpu")
trainer.fit(model, train_dataloader, test_dataloader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                                | Params
--------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M
--------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")


In [48]:
trainer.test(model, test_dataloader)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

[{'test_acc': 0.49074074625968933,
  'test_precision': 0.3865598440170288,
  'test_recall': 0.49074074625968933,
  'test_f1': 0.38008180260658264}]

In [19]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW

# Загрузка предобученной модели и токенизатора
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classi

In [20]:
for name, param in model.named_parameters():
    if 'classifier' not in name:  # Имя параметра содержит 'classifier'
        param.requires_grad = False

In [21]:
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Количество обучаемых параметров: {total_params}")

Количество обучаемых параметров: 592130


In [22]:
# Обучение модели
optimizer = AdamW(model.parameters(), lr=5e-5)
# train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=True)
model.train()
model.cuda()
for epoch in range(3):
    for batch in tqdm(train_dataloader):
        optimizer.zero_grad()
        input_ids, attention_mask, labels = batch
        input_ids = input_ids.cuda()
        attention_mask = attention_mask.cuda()
        labels = labels.cuda()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()


100%|██████████| 297/297 [00:26<00:00, 11.41it/s]
100%|██████████| 297/297 [00:25<00:00, 11.55it/s]
100%|██████████| 297/297 [00:26<00:00, 11.38it/s]


In [23]:
        
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for batch in test_dataloader:
        input_ids, attention_mask, labels = batch
        input_ids = input_ids.cuda()
        attention_mask = attention_mask.cuda()
        # labels = labels.cuda()
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits.detach().cpu(), dim=1)
        correct += (predictions == labels).sum().item()
        total += len(labels)
accuracy = correct / total
print('Accuracy:', accuracy)

Accuracy: 0.5050505050505051


In [33]:
class TextClassification(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
        self.model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids, attention_mask=attention_mask)

    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        output = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = output.loss
        self.log('train_loss', loss, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        outputs = self(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)
        correct = (predictions == labels).sum().item()
        total = len(labels)
        accuracy = correct / total
        self.log('val_accuracy', accuracy)
        return accuracy

    def configure_optimizers(self):
        optimizer = AdamW(self.parameters(), lr=5e-5)
        return optimizer

In [34]:
model = TextClassification()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classi

In [35]:
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Количество обучаемых параметров: {total_params}")

Количество обучаемых параметров: 66955010


In [36]:
trainer = pl.Trainer(max_epochs=4, accelerator="gpu")
trainer.fit(model, train_dataloader, test_dataloader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                                | Params
--------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 67.0 M
--------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
