## Практическое задание

1. Возьмите готовую модель из https://huggingface.co/models для классификации сентимента текста.
2. Сделайте предсказания на всем df_val. Посчитайте метрику качества.
3. Дообучите эту модель на df_train. Посчитайте метрику качества на df_val.

Данные на google drive: https://drive.google.com/file/d/1Mev_EEput0LlBj8MDHIJkBtahlJ6J901

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from tqdm import tqdm
from collections import Counter
import pandas as pd
from sklearn.metrics import accuracy_score
!pip install transformers
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **Download data**

In [None]:
# скачиваем данные
!wget 'https://drive.google.com/uc?export=download&id=1Mev_EEput0LlBj8MDHIJkBtahlJ6J901' -O data.zip

--2022-07-30 16:25:34--  https://drive.google.com/uc?export=download&id=1Mev_EEput0LlBj8MDHIJkBtahlJ6J901
Resolving drive.google.com (drive.google.com)... 172.217.194.113, 172.217.194.139, 172.217.194.138, ...
Connecting to drive.google.com (drive.google.com)|172.217.194.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-c0-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/e4u73aa57e2d4mr9k6klpoas33eudakb/1659198300000/14904333240138417226/*/1Mev_EEput0LlBj8MDHIJkBtahlJ6J901?e=download&uuid=ecf6fb57-2459-426f-a0f4-ecb1245235d2 [following]
--2022-07-30 16:25:39--  https://doc-14-c0-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/e4u73aa57e2d4mr9k6klpoas33eudakb/1659198300000/14904333240138417226/*/1Mev_EEput0LlBj8MDHIJkBtahlJ6J901?e=download&uuid=ecf6fb57-2459-426f-a0f4-ecb1245235d2
Resolving doc-14-c0-docs.googleusercontent.com (doc-14-c0-docs.googleusercontent.com)... 142.250.4.132, 24

In [None]:
# распаковываем данные
!unzip data.zip

Archive:  data.zip
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace val.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [None]:
# считываем трейн и тест
df_train = pd.read_csv('train.csv')
df_val = pd.read_csv('val.csv')

df_train.shape, df_val.shape

((181467, 3), (22683, 3))

In [None]:
df_train.head()

Unnamed: 0,id,text,class
0,0,@alisachachka не уезжаааааааай. :(❤ я тоже не ...,0
1,1,RT @GalyginVadim: Ребята и девчата!\nВсе в кин...,1
2,2,RT @ARTEM_KLYUSHIN: Кто ненавидит пробки ретви...,0
3,3,RT @epupybobv: Хочется котлету по-киевски. Зап...,1
4,4,@KarineKurganova @Yess__Boss босапопа есбоса н...,1


In [None]:
df_train['class'].value_counts()

1    92063
0    89404
Name: class, dtype: int64

## **Choose model**

In [None]:
idx = 11
print(df_train.iloc[idx]['text'])
print('label is', df_train.iloc[idx]['class'])

мартовские путёвки дорожают на глазах. только пару дней назад были за 66, уже 86 о_О
label is 0


In [None]:
# # посмотрим на качество разных моделей

model_list = ['Maha/xlmtwtroberta_label2',  
              'sismetanin/rubert-toxic-pikabu-2ch', 
              'IlyaGusev/rubertconv_toxic_clf']

for i in range(len(model_list)):
  temp = pipeline('text-classification', model=model_list[i])
  result = temp(df_train.iloc[idx]['text'])
  print(result, model_list[i])

[{'label': 'LABEL_0', 'score': 0.8964649438858032}] Maha/xlmtwtroberta_label2
[{'label': 'LABEL_0', 'score': 0.9976280331611633}] sismetanin/rubert-toxic-pikabu-2ch
[{'label': 'neutral', 'score': 0.9995463490486145}] IlyaGusev/rubertconv_toxic_clf


In [None]:
# выберем модель IlyaGusev/rubertconv_toxic_clf

In [None]:
# приведем текст к нижнему регистру
df_train['text'] = df_train['text'].apply(lambda x: x.lower())
df_val['text'] = df_val['text'].apply(lambda x: x.lower())

## **Tokenizer**

In [None]:
# создадим кастомный датасет

class CustomDataset(torch.utils.data.Dataset):
    
    def __init__(self, txts, labels):      
        self._labels = labels
        self.tokenizer = AutoTokenizer.from_pretrained('IlyaGusev/rubertconv_toxic_clf')
        self._txts = [self.tokenizer(text, 
                                     padding='max_length', 
                                     max_length=10,
                                     truncation=True, 
                                     return_tensors='pt')
                      for text in txts]
        
    def __len__(self):
        return len(self._txts)
    
    def __getitem__(self, index):
        return self._txts[index], self._labels[index]

In [None]:
# сформируем батчи

y_train = df_train['class'].values
y_val = df_val['class'].values

train_dataset = CustomDataset(df_train['text'], y_train)
valid_dataset = CustomDataset(df_val['text'], y_val)

train_loader = torch.utils.data.DataLoader(train_dataset,
                                           batch_size=128,
                                           shuffle=True,
                                           num_workers=2)
valid_loader = torch.utils.data.DataLoader(valid_dataset,
                                           batch_size=128,
                                           shuffle=False,
                                           num_workers=1)

In [None]:
# посмотрим на первый экземпляр

for txt, lbl in train_loader:
    print(txt.keys())
    print(txt['input_ids'].shape)
    break

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
torch.Size([128, 1, 10])


In [None]:
# создадим класс классификатора

class AutoModelForSequenceClassifier(nn.Module):

    def __init__(self, dropout=0.5):
        super().__init__()
        self.bert = AutoModelForSequenceClassification.from_pretrained("IlyaGusev/rubertconv_toxic_clf", output_hidden_states=True)
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(9984, 2)
        self.sigm = nn.Sigmoid()


    def forward(self, x, mask):  
        pooled_output = self.bert(input_ids=x, attention_mask=mask, return_dict=False)  
        pooled_output = torch.cat(tuple([pooled_output[1][i] for i in range(len(pooled_output[1]))]), dim=-1)
        pooled_output = pooled_output[:, 0, :]          
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        final_layer = self.sigm(linear_output)
        return final_layer

In [None]:
# проверим устройство
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [None]:
# инициализируем модель
model = AutoModelForSequenceClassifier().to(device)
criterion = nn.CrossEntropyLoss()
# optimizer = Adam(model.parameters(), lr=0.001)  # полное обучение
optimizer = Adam(model.linear.parameters(), lr=0.001)  # неполное обучение

In [None]:
# посмотрим архитектуру модели
print(model)
print("Parameters full train:", sum([param.nelement() for param in model.parameters()]))
print("Parameters transfer learning:", sum([param.nelement() for param in model.linear.parameters()]))

AutoModelForSequenceClassifier(
  (bert): BertForSequenceClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(119547, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_featu

In [None]:
# сделаем предсказания на тестовых данных
model.eval()
data_iter = iter(valid_loader)
test_samples, test_labels = data_iter.next()
test_samples = test_samples.to(device)
test_labels = test_labels.to(device)

predictions = model(test_samples['input_ids'].squeeze(1), test_samples['attention_mask'])

test_true_labels = test_labels.cpu().numpy()
predictions_labels = predictions.argmax(dim=1).cpu().numpy()

print(test_true_labels[:15])
print(predictions_labels[:15])

[1 0 0 0 0 0 1 0 0 1 0 1 1 1 0]
[1 1 0 0 1 1 1 1 1 1 1 1 1 0 1]


In [None]:
# посмотрим на метрику
accuracy_score(test_true_labels, predictions_labels)

0.5078125

In [None]:
# проведем обучение

for epoch_num in range(5):
    total_acc_train = 0
    total_loss_train = 0

    model.train()
    for train_input, train_label in tqdm(train_loader):
        mask = train_input['attention_mask'].to(device)
        input_id = train_input['input_ids'].squeeze(1).to(device)
        train_label = train_label.to(device)

        output = model(input_id, mask)
                
        batch_loss = criterion(output, train_label)
        total_loss_train += batch_loss.item()
                
        acc = (output.argmax(dim=1) == train_label).sum().item()
        total_acc_train += acc

        model.zero_grad()
        batch_loss.backward()
        optimizer.step()
            
    model.eval()
    total_loss_val, total_acc_val = 0.0, 0.0
    for val_input, val_label in valid_loader:
        val_label = val_label.to(device)
        mask = val_input['attention_mask'].to(device)
        input_id = val_input['input_ids'].squeeze(1).to(device)

        output = model(input_id, mask)

        batch_loss = criterion(output, val_label)
        total_loss_val += batch_loss.item()
                    
        acc = (output.argmax(dim=1) == val_label).sum().item()
        total_acc_val += acc
            
    print(
        f'Epochs: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_dataset): .3f} \
        | Train Accuracy: {total_acc_train / len(train_dataset): .3f} \
        | Val Loss: {total_loss_val / len(valid_dataset): .3f} \
        | Val Accuracy: {total_acc_val / len(valid_dataset): .3f}')

100%|██████████| 1418/1418 [04:58<00:00,  4.75it/s]


Epochs: 1 | Train Loss:  0.005         | Train Accuracy:  0.631         | Val Loss:  0.005         | Val Accuracy:  0.644


100%|██████████| 1418/1418 [04:58<00:00,  4.75it/s]


Epochs: 2 | Train Loss:  0.005         | Train Accuracy:  0.640         | Val Loss:  0.005         | Val Accuracy:  0.645


100%|██████████| 1418/1418 [04:58<00:00,  4.76it/s]


Epochs: 3 | Train Loss:  0.005         | Train Accuracy:  0.639         | Val Loss:  0.005         | Val Accuracy:  0.643


100%|██████████| 1418/1418 [04:58<00:00,  4.75it/s]


Epochs: 4 | Train Loss:  0.005         | Train Accuracy:  0.641         | Val Loss:  0.005         | Val Accuracy:  0.647


100%|██████████| 1418/1418 [04:58<00:00,  4.76it/s]


Epochs: 5 | Train Loss:  0.005         | Train Accuracy:  0.641         | Val Loss:  0.005         | Val Accuracy:  0.649


In [None]:
# сделаем предсказания на тестовых данных
model.eval()
data_iter = iter(valid_loader)
test_samples, test_labels = data_iter.next()
test_samples = test_samples.to(device)
test_labels = test_labels.to(device)

predictions = model(test_samples['input_ids'].squeeze(1), test_samples['attention_mask'])

test_true_labels = test_labels.cpu().numpy()
predictions_labels = predictions.argmax(dim=1).cpu().numpy()

print(test_true_labels[:15])
print(predictions_labels[:15])

[1 0 0 0 0 0 1 0 0 1 0 1 1 1 0]
[1 0 0 1 0 1 1 1 0 1 1 1 0 1 0]


In [None]:
# посмотрим на метрику
accuracy_score(test_true_labels, predictions_labels)

0.703125

### **Вывод:**

Модель удалось дообучить и повысить точность предсказания.