<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>Физтех-Школа Прикладной математики и информатики (ФПМИ) МФТИ</b></h3>

---

# Задание 3

## Классификация текстов

В этом задании вам предстоит попробовать несколько методов, используемых в задаче классификации, а также понять насколько хорошо модель понимает смысл слов и какие слова в примере влияют на результат.

In [None]:
!pip install -U torchtext==0.8.0
!pip install -q captum

Requirement already up-to-date: torchtext==0.8.0 in /usr/local/lib/python3.7/dist-packages (0.8.0)


In [None]:
import pandas as pd
import numpy as np
import torch

from torchtext.datasets import IMDB
# from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

from torchtext.vocab import Vectors, GloVe

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
from tqdm.autonotebook import tqdm

from sklearn.metrics import (
    f1_score, 
    confusion_matrix, 
    accuracy_score, 
    roc_auc_score,
    classification_report,
    balanced_accuracy_score
)

from captum.attr import LayerIntegratedGradients, TokenReferenceBase, visualization

В этом задании мы будем использовать библиотеку torchtext. Она довольна проста в использовании и поможет нам сконцентрироваться на задаче, а не на написании Dataloader-а.

In [56]:
accuracy_score

<function sklearn.metrics._classification.accuracy_score>

In [None]:
TEXT_rnn = Field(sequential=True, lower=True, include_lengths=True)  # Поле текста
LABEL_rnn = LabelField(dtype=torch.float)  # Поле метки



In [None]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.random.manual_seed(SEED)
torch.cuda.random.manual_seed(SEED)
torch.cuda.random.manual_seed_all(SEED)

device = "cuda" if torch.cuda.is_available() else "cpu"

Датасет на котором мы будем проводить эксперементы это комментарии к фильмам из сайта IMDB.

In [None]:
train_rnn, test_rnn = IMDB.splits(TEXT_rnn, LABEL_rnn)  # загрузим датасет
train_rnn, valid_rnn = train_rnn.split(random_state=random.seed(SEED))  # разобьем на части



In [None]:
TEXT_rnn.build_vocab(train_rnn)
LABEL_rnn.build_vocab(train_rnn)

In [None]:
train_iter_rnn, valid_iter_rnn, test_iter_rnn = BucketIterator.splits(
    (train_rnn, valid_rnn, test_rnn), 
    batch_size = 64,
    sort_within_batch = True,
    device = device)



## RNN

Для начала попробуем использовать рекурентные нейронные сети. На семинаре вы познакомились с GRU, вы можете также попробовать LSTM. Можно использовать для классификации как hidden_state, так и output последнего токена.

In [None]:
from torch import nn 

In [None]:
class RNNBaseline(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx, num_classes=2):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim*2, output_dim)

        self.drop = nn.Dropout(dropout)
        self.n_layers = n_layers
        
        self.relu = nn.ReLU()
        
    def forward(self, text, text_lengths):
        
        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        embedded = self.drop(embedded)
        
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
      
        packed_output, (hidden, cell) = self.rnn(packed_embedded)

        hidden = self.relu(torch.cat([hidden[-1,:,:], hidden[-2,:,:]], 1))

        hidden = self.drop(hidden)
        fc = self.fc(hidden)
        return  fc

vocab_size = len(TEXT_rnn.vocab)
emb_dim = 300
hidden_dim = 256
output_dim = 1
n_layers = 3
bidirectional = True
dropout = 0.5
PAD_IDX = TEXT_rnn.vocab.stoi[TEXT_rnn.pad_token]
patience= 4
eps = 1e-3

model_rnn = RNNBaseline(
    vocab_size=vocab_size,
    embedding_dim=emb_dim,
    hidden_dim=hidden_dim,
    output_dim=output_dim,
    n_layers=n_layers,
    bidirectional=bidirectional,
    dropout=dropout,
    pad_idx=PAD_IDX
)
model_rnn = model_rnn.to(device)
opt = torch.optim.AdamW(model_rnn.parameters(), lr=eps)
loss_func_rnn = nn.BCEWithLogitsLoss()

max_epochs = 15

Поиграйтесь с гиперпараметрами

Обучите сетку! Используйте любые вам удобные инструменты, Catalyst, PyTorch Lightning или свои велосипеды.

In [None]:
import numpy as np

min_loss = np.inf

cur_patience = 0

best_model_rnn = model_rnn.state_dict()

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model_rnn = model_rnn.train()
    pbar = tqdm(enumerate(train_iter_rnn), total=len(train_iter_rnn), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar: 
        opt.zero_grad()
        #YOUR CODE GOES HERE
        # print(batch.text)
        text = batch.text
        target = batch.label.to(device)
        preds = model_rnn(text[0], text[1].cpu()).reshape(-1)
        loss = loss_func_rnn(preds, target)
        train_loss += loss.item() 

        loss.backward()
        opt.step()
        

    train_loss /= len(train_iter_rnn)
    val_loss = 0.0
    with torch.no_grad():
      pbar = tqdm(enumerate(valid_iter_rnn), total=len(valid_iter_rnn), leave=False)
      pbar.set_description(f"Epoch {epoch}")
      for it, batch in pbar:
          # YOUR CODE GOES HERE
          text = batch.text
          target = batch.label.to(device)
          preds = model_rnn(text[0], list(text[1])).reshape(-1)
          loss = loss_func_rnn(preds, target)
          val_loss += loss.item() 

      val_loss /= len(valid_iter_rnn)
    if val_loss < min_loss:
        min_loss = val_loss
        best_model_rnn = model_rnn.state_dict()
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, train_loss, val_loss))
model_rnn.load_state_dict(best_model_rnn)

HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))





HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 1, Training Loss: 0.6865468040434983, Validation Loss: 0.6964398905382319


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 2, Training Loss: 0.6441184462857072, Validation Loss: 0.6275230742107003


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 3, Training Loss: 0.5824857880599308, Validation Loss: 0.6230631263579353


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 4, Training Loss: 0.49299002998936786, Validation Loss: 0.4896725090378422


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 5, Training Loss: 0.46381944657242213, Validation Loss: 0.5532356780969491


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 6, Training Loss: 0.40040823229908074, Validation Loss: 0.4919732076636815


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 7, Training Loss: 0.27177637434788865, Validation Loss: 0.428723109980761


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 8, Training Loss: 0.21267763680241403, Validation Loss: 0.40592230793278095


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))

Epoch: 9, Training Loss: 0.16510119259248685, Validation Loss: 0.420420014504659


HBox(children=(FloatProgress(value=0.0, max=274.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=118.0), HTML(value='')))



<All keys matched successfully>

Посчитайте f1-score вашего классификатора на тестовом датасете.

**Ответ**:

In [57]:
def testing(model, criterion, test_loader, device="cpu"):
  
  # pbar = tqdm(test_loader, desc=f"Test Loss: {0}, Test Acc: {0}")
  pbar = tqdm(enumerate(test_loader), total=len(test_loader), leave=False)
  mean_loss = 0
  mean_f1 = 0
  N_batches = 0
  model.eval()
  with torch.no_grad():
    for it, batch in pbar:
      # YOUR CODE GOES HERE
      text = batch.text
      target = batch.label.to(device)
      preds = model(text[0], list(text[1])).reshape(-1).to(device)
      loss = criterion(preds, target)
      # loss = loss.item()

      preds = torch.round(torch.sigmoid(preds))
      
      # f1 = f1_score(target, preds, average='weighted')
      f1 = accuracy_score(target, preds)

      mean_loss += loss.item()
      mean_f1 += f1
      N_batches = it

      pbar.set_description(f"Test Loss: {loss:.4}, Test F1: {f1:.4}")
  
  N_batches += 1
  pbar.set_description(f"Test Loss: {mean_loss / N_batches:.4}, Test F1: {mean_f1 / N_batches:.4}")

  return {"Test Loss": mean_loss / N_batches, "Test F1": mean_f1 / N_batches}

In [58]:
testing(model_rnn, loss_func_rnn, test_iter_rnn, device="cpu")

HBox(children=(FloatProgress(value=0.0, max=391.0), HTML(value='')))



{'Test F1': 0.8613331202046036, 'Test Loss': 0.3981961988846359}

## CNN

![](https://www.researchgate.net/publication/333752473/figure/fig1/AS:769346934673412@1560438011375/Standard-CNN-on-text-classification.png)

Для классификации текстов также часто используют сверточные нейронные сети. Идея в том, что как правило сентимент содержат словосочетания из двух-трех слов, например "очень хороший фильм" или "невероятная скука". Проходясь сверткой по этим словам мы получим какой-то большой скор и выхватим его с помощью MaxPool. Далее идет обычная полносвязная сетка. Важный момент: свертки применяются не последовательно, а параллельно. Давайте попробуем!

In [None]:
TEXT_base_cnn = Field(sequential=True, lower=True, batch_first=True)  # batch_first тк мы используем conv  
LABEL_base_cnn = LabelField(batch_first=True, dtype=torch.float)

train_base_cnn, tst_base_cnn = IMDB.splits(TEXT_base_cnn, LABEL_base_cnn)
trn_base_cnn, vld_base_cnn = train_base_cnn.split(random_state=random.seed(SEED))

TEXT_base_cnn.build_vocab(trn_base_cnn)
LABEL_base_cnn.build_vocab(trn_base_cnn)

device = "cuda" if torch.cuda.is_available() else "cpu"



In [None]:
train_iter_cnn, val_iter_cnn, test_iter_cnn = BucketIterator.splits(
        (trn_base_cnn, vld_base_cnn, tst_base_cnn),
        batch_sizes=(128, 256, 256),
        sort=False,
        sort_key= lambda x: len(x.src),
        sort_within_batch=False,
        device=device,
        repeat=False,
)



Вы можете использовать Conv2d с `in_channels=1, kernel_size=(kernel_sizes[0], emb_dim))` или Conv1d c `in_channels=emb_dim, kernel_size=kernel_size[0]`. Но хорошенько подумайте над shape в обоих случаях.

In [None]:
class CNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        emb_dim,
        out_channels,
        kernel_sizes,
        dropout=0.5,
    ):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        # in_channels, out_channels, kernel_size
        self.conv_0 = nn.Conv1d(emb_dim, out_channels, kernel_size=kernel_sizes[0], padding=1, stride=2,)  # YOUR CODE GOES HERE
        
        self.conv_1 = nn.Conv1d(emb_dim, out_channels, kernel_size=kernel_sizes[2], padding=1, stride=2,)  # YOUR CODE GOES HERE
        
        self.conv_2 = nn.Conv1d(emb_dim, out_channels, kernel_size=kernel_sizes[1], padding=1, stride=2,)  # YOUR CODE GOES HERE
        
        self.fc = nn.Linear(len(kernel_sizes) * out_channels, 1)

        self.norm = nn.LayerNorm(emb_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, text):
        embedded = self.embedding(text)

        embedded = self.norm(embedded)
        embedded = self.dropout(embedded)
        
        embedded = embedded.permute(0, 2, 1)  # may be reshape here
        # print(embedded.shape, embedded)
        
        conved_0 = F.relu(self.conv_0(embedded))  # may be reshape here
        conved_1 = F.relu(self.conv_1(embedded))  # may be reshape here
        conved_2 = F.relu(self.conv_2(embedded))  # may be reshape here
        
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
        pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)
        
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))
            
        return self.fc(cat)

kernel_sizes = [2, 4, 5]
vocab_size = len(TEXT_base_cnn.vocab)
out_channels = 64
dropout = 0.4
lr_base_cnn = 2e-3
dim = 300

model_base_cnn = CNN(vocab_size=vocab_size, emb_dim=dim, out_channels=out_channels,
            kernel_sizes=kernel_sizes, dropout=dropout)

model_base_cnn = model_base_cnn.to(device)

opt_base_cnn = torch.optim.AdamW(model_base_cnn.parameters(),lr=lr_base_cnn)
loss_func_base_cnn = nn.BCEWithLogitsLoss()

max_epochs = 20
patience = 2

Обучите!

In [None]:
import numpy as np

min_loss = np.inf

cur_patience = 0

best_model_base_cnn = model_base_cnn

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model_base_cnn.train()
    pbar = tqdm(enumerate(train_iter_cnn), total=len(train_iter_cnn), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar: 
        #YOUR CODE GOES HERE
        opt_base_cnn.zero_grad()
        text = batch.text.to(device)
        target = batch.label.to(device)
        preds = model_base_cnn(text).reshape(-1)
        loss = loss_func_base_cnn(preds, target)
        train_loss += loss.item() 

        loss.backward()
        opt_base_cnn.step()

    train_loss /= len(train_iter_cnn)
    val_loss = 0.0
    with torch.no_grad():
      pbar = tqdm(enumerate(val_iter_cnn), total=len(val_iter_cnn), leave=False)
      pbar.set_description(f"Epoch {epoch}")
      for it, batch in pbar:
          # YOUR CODE GOES HERE
          text = batch.text.to(device)
          target = batch.label.to(device)
          preds = model_base_cnn(text).reshape(-1)
          loss = loss_func_base_cnn(preds, target)
          val_loss += loss.item()

    val_loss /= len(val_iter_cnn)
    if val_loss < min_loss:
        min_loss = val_loss
        best_model_base_cnn = model_base_cnn.state_dict()
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, train_loss, val_loss))
model_base_cnn.load_state_dict(best_model_base_cnn)

HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))





HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 1, Training Loss: 0.6981223269100607, Validation Loss: 0.6094233453273773


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 2, Training Loss: 0.5501720124352587, Validation Loss: 0.5209705750147502


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 3, Training Loss: 0.4448002265752667, Validation Loss: 0.4675338516632716


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 4, Training Loss: 0.346447604851131, Validation Loss: 0.43834369083245595


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 5, Training Loss: 0.27464727709328174, Validation Loss: 0.41762318313121793


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 6, Training Loss: 0.2060459777168984, Validation Loss: 0.4363574147224426


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))



<All keys matched successfully>

Посчитайте f1-score вашего классификатора.

**Ответ**:

In [61]:
def testing_cnn(test_model, test_criterion, test_loader, device="cpu"):
  
  # pbar = tqdm(test_loader, desc=f"Test Loss: {0}, Test Acc: {0}")
  pbar = tqdm(enumerate(test_loader), total=len(test_loader), leave=False)
  mean_loss = 0
  mean_f1 = 0
  N_batches = 0
  test_model.eval()
  with torch.no_grad():
    for it, batch in pbar:
      # YOUR CODE GOES HERE
      text = batch.text
      target = batch.label
      preds = test_model(text).reshape(-1)
      loss = test_criterion(preds, target)
      # loss = loss.item()

      preds = torch.round(torch.sigmoid(preds))
      
      # f1 = f1_score(target.to(device), preds.to(device), average='weighted')
      f1 = accuracy_score(target.to(device), preds.to(device))

      mean_loss += loss.item()
      mean_f1 += f1

      N_batches = it

      pbar.set_description(f"Test Loss: {loss:.4}, Test F1: {f1:.4}")
  N_batches = len(test_loader)
  pbar.set_description(f"Test Loss: {mean_loss / N_batches:.4}, Test F1: {mean_f1 / N_batches:.4}")

  return {"Test Loss": mean_loss / N_batches, "Test F1": mean_f1 / N_batches}

In [62]:
testing_cnn(model_base_cnn, loss_func_base_cnn, test_iter_cnn, device="cpu")

HBox(children=(FloatProgress(value=0.0, max=98.0), HTML(value='')))



{'Test F1': 0.8553206997084548, 'Test Loss': 0.3374035043679938}

## Интерпретируемость

Посмотрим, куда смотрит наша модель. Достаточно запустить код ниже.

[K     |████████████████████████████████| 4.4MB 8.1MB/s 
[?25h

In [None]:


PAD_IND_base_cnn = TEXT_base_cnn.vocab.stoi['pad']

token_reference = TokenReferenceBase(reference_token_idx=PAD_IND_base_cnn)
lig = LayerIntegratedGradients(model_base_cnn, model_base_cnn.embedding)

In [None]:
def forward_with_softmax(model, inp):
    logits = model(inp)
    return torch.softmax(logits, 0)[0][1]

def forward_with_sigmoid(model, input):
    return torch.sigmoid(model(input))


# accumalate couple samples in this array for visualization purposes
vis_data_records_ig = []

def interpret_sentence(model, sentence, min_len = 7, label = 0):
    model.eval()
    text = [tok for tok in TEXT_base_cnn.tokenize(sentence)]
    if len(text) < min_len:
        text += ['pad'] * (min_len - len(text))
    indexed = [TEXT_base_cnn.vocab.stoi[t] for t in text]

    model.zero_grad()

    input_indices = torch.tensor(indexed, device=device)
    input_indices = input_indices.unsqueeze(0)
    
    # input_indices dim: [sequence_length]
    seq_length = min_len

    # predict
    pred = forward_with_sigmoid(model, input_indices).item()
    pred_ind = round(pred)

    # generate reference indices for each sample
    reference_indices = token_reference.generate_reference(seq_length, device=device).unsqueeze(0)

    # compute attributions and approximation delta using layer integrated gradients
    attributions_ig, delta = lig.attribute(input_indices, reference_indices, \
                                           n_steps=5000, return_convergence_delta=True)

    print('pred: ', LABEL_base_cnn.vocab.itos[pred_ind], '(', '%.2f'%pred, ')', ', delta: ', abs(delta))

    add_attributions_to_visualizer(attributions_ig, text, pred, pred_ind, label, delta, vis_data_records_ig)
    
def add_attributions_to_visualizer(attributions, text, pred, pred_ind, label, delta, vis_data_records):
    attributions = attributions.sum(dim=2).squeeze(0)
    attributions = attributions / torch.norm(attributions)
    attributions = attributions.cpu().detach().numpy()

    # storing couple samples in an array for visualization purposes
    vis_data_records.append(visualization.VisualizationDataRecord(
                            attributions,
                            pred,
                            LABEL_base_cnn.vocab.itos[pred_ind],
                            LABEL_base_cnn.vocab.itos[label],
                            LABEL_base_cnn.vocab.itos[1],
                            attributions.sum(),       
                            text,
                            delta))

In [None]:
interpret_sentence(model_base_cnn, 'It was a fantastic performance !', label=1)
interpret_sentence(model_base_cnn, 'Best film ever', label=1)
interpret_sentence(model_base_cnn, 'Such a great show!', label=1)
interpret_sentence(model_base_cnn, 'It was a horrible movie', label=0)
interpret_sentence(model_base_cnn, 'I\'ve never watched something as bad', label=0)
interpret_sentence(model_base_cnn, 'It is a disgusting movie!', label=0)

pred:  pos ( 0.82 ) , delta:  tensor([1.4310e-06], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.50 ) , delta:  tensor([3.5274e-05], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.76 ) , delta:  tensor([9.3281e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.32 ) , delta:  tensor([4.5477e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.23 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.55 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)


Попробуйте добавить свои примеры!

In [None]:
print('Visualize attributions based on Integrated Gradients')
visualization.visualize_text(vis_data_records_ig)
None

Visualize attributions based on Integrated Gradients


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.82),pos,1.84,It was a fantastic performance ! pad
,,,,
pos,neg (0.50),pos,0.36,Best film ever pad pad pad pad
,,,,
pos,pos (0.76),pos,1.0,Such a great show! pad pad pad
,,,,
neg,neg (0.32),pos,-1.18,It was a horrible movie pad pad
,,,,
neg,neg (0.23),pos,-0.93,I've never watched something as bad pad
,,,,


## Эмбэдинги слов

Вы ведь не забыли, как мы можем применить знания о word2vec и GloVe. Давайте попробуем!

In [None]:
TEXT_base_cnn.build_vocab(trn_base_cnn, vectors='glove.6B.300d')# YOUR CODE GOES HERE
# подсказка: один из импортов пока не использовался, быть может он нужен в строке выше :)
LABEL_base_cnn.build_vocab(trn_base_cnn)

In [None]:

word_embeddings = TEXT_base_cnn.vocab.vectors

kernel_sizes = [3, 4, 5]
vocab_size = len(TEXT_base_cnn.vocab)
dropout = 0.4
lr_base_cnn = 2e-3
dim = 300


train_emb, tst_emb = IMDB.splits(TEXT_base_cnn, LABEL_base_cnn)
trn_emb, vld_emb = train_emb.split(random_state=random.seed(SEED))

device = "cuda" if torch.cuda.is_available() else "cpu"

train_iter_emb, val_iter_emb, test_iter_emb = BucketIterator.splits(
        (trn_emb, vld_emb, tst_emb),
        batch_sizes=(128, 256, 256),
        sort=False,
        sort_key= lambda x: len(x.src),
        sort_within_batch=False,
        device=device,
        repeat=False,
)

model_emb = CNN(vocab_size=vocab_size, emb_dim=dim, out_channels=64,
            kernel_sizes=kernel_sizes, dropout=dropout)

word_embeddings = TEXT_base_cnn.vocab.vectors

prev_shape = model_emb.embedding.weight.shape

# model.embedding.weight = nn.parameter.Parameter(word_embeddings, requires_grad=True)# инициализируйте эмбэдинги
model_emb.embedding = nn.Embedding.from_pretrained(word_embeddings)

assert prev_shape == model_emb.embedding.weight.shape
model_emb.to(device)

opt_emb = torch.optim.AdamW(model_emb.parameters(), lr=lr_base_cnn)



Вы знаете, что делать.

In [None]:
def freeze_embeddings(model, req_grad=False):
    embeddings = model.embedding
    for c_p in embeddings.parameters():
        c_p.requires_grad = req_grad

In [None]:
import numpy as np

min_loss = np.inf

cur_patience = 0
max_grad_norm = 3
num_iter = 0
num_freeze_iter = 400
max_epochs = 20
flag = True
patience = 3

best_model_emb = model_emb.state_dict()

freeze_embeddings(model_emb, False)

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model_emb = model_emb.train()
    pbar = tqdm(enumerate(train_iter_emb), total=len(train_iter_emb), leave=False, )
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar: 
        #YOUR CODE GOES HERE
        if num_iter > num_freeze_iter and flag:
          freeze_embeddings(model_emb, True)
          flag = False
        opt_emb.zero_grad()
        #YOUR CODE GOES HERE
        text = batch.text.to(device)
        target = batch.label.to(device)
        preds = model_emb(text).reshape(-1)
        loss = loss_func_base_cnn(preds, target)
        train_loss += loss.item() 
        loss.backward()
        num_iter += 1

        if max_grad_norm is not None:
          torch.nn.utils.clip_grad_norm_(model_emb.parameters(), max_grad_norm)

        opt_emb.step()

    train_loss /= len(train_iter_emb)
    val_loss = 0.0
    model_emb.eval()
    pbar = tqdm(enumerate(val_iter_emb), total=len(val_iter_emb), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar:
        # YOUR CODE GOES HERE
        text = batch.text.to(device)
        target = batch.label.to(device)
        preds = model_emb(text).reshape(-1)
        loss = loss_func_base_cnn(preds, target)
        val_loss += loss.item()

    val_loss /= len(val_iter_emb)
    if val_loss < min_loss:
        min_loss = val_loss
        best_model_emb = model_emb.state_dict()
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, train_loss, val_loss))
model_emb.load_state_dict(best_model_emb)

HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))





HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 1, Training Loss: 0.556328232488493, Validation Loss: 0.38490845263004303


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 2, Training Loss: 0.4137953332740895, Validation Loss: 0.36386048793792725


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 3, Training Loss: 0.4049737660119133, Validation Loss: 0.41368531982103984


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 4, Training Loss: 0.33234186579276176, Validation Loss: 0.3024956891934077


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))

Epoch: 5, Training Loss: 0.054877064795824736, Validation Loss: 0.35640181253353753


HBox(children=(FloatProgress(value=0.0, max=137.0), HTML(value='')))



HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))



<All keys matched successfully>

Посчитайте f1-score вашего классификатора.

**Ответ**:

In [None]:
testing_cnn(model_emb, loss_func_base_cnn, test_iter_emb, device="cpu")

HBox(children=(FloatProgress(value=0.0, max=98.0), HTML(value='')))





{'Test F1': 0.46052855859806485, 'Test Loss': 0.4082462769381854}

Проверим насколько все хорошо!

In [None]:
PAD_IND = TEXT_base_cnn.vocab.stoi['pad']

token_reference = TokenReferenceBase(reference_token_idx=PAD_IND)
lig = LayerIntegratedGradients(model_emb, model_emb.embedding)
vis_data_records_ig = []

interpret_sentence(model_emb, 'It was a fantastic performance !', label=1)
interpret_sentence(model_emb, 'Best film ever', label=1)
interpret_sentence(model_emb, 'Such a great show!', label=1)
interpret_sentence(model_emb, 'It was a horrible movie', label=0)
interpret_sentence(model_emb, 'I\'ve never watched something as bad', label=0)
interpret_sentence(model_emb, 'It is a disgusting movie!', label=0)
interpret_sentence(model_emb, 'Is it a Russian movie?', label=0)


pred:  pos ( 0.93 ) , delta:  tensor([5.3762e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.20 ) , delta:  tensor([2.2721e-05], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.81 ) , delta:  tensor([0.0003], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([0.0004], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.32 ) , delta:  tensor([0.0003], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([0.0001], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)


In [None]:
print('Visualize attributions based on Integrated Gradients')
visualization.visualize_text(vis_data_records_ig)
None

Visualize attributions based on Integrated Gradients


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.93),pos,2.04,It was a fantastic performance ! pad
,,,,
pos,neg (0.20),pos,1.35,Best film ever pad pad pad pad
,,,,
pos,pos (0.81),pos,1.79,Such a great show! pad pad pad
,,,,
neg,neg (0.00),pos,-0.88,It was a horrible movie pad pad
,,,,
neg,neg (0.32),pos,1.41,I've never watched something as bad pad
,,,,
