<a href="https://colab.research.google.com/github/Uniholder/DeepLearningSchool/blob/main/2_semester/3_RNN/%5Bhomework%5Dclassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>Физтех-Школа Прикладной математики и информатики (ФПМИ) МФТИ</b></h3>

---

# Задание 3

## Классификация текстов

В этом задании вам предстоит попробовать несколько методов, используемых в задаче классификации, а также понять насколько хорошо модель понимает смысл слов и какие слова в примере влияют на результат.

In [1]:
!nvidia-smi

Mon Oct 25 00:05:11 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
import pandas as pd
import numpy as np
import torch

from torchtext.legacy import datasets
from torchtext.legacy.data import Field, LabelField, BucketIterator, dataset
from torchtext.vocab import Vectors, GloVe

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
from tqdm.autonotebook import tqdm

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

from sklearn.metrics import confusion_matrix

def precision(tp, fp):
    pp = tp + fp
    return (tp / pp) if pp != 0 else 0
def recall(tp, fn):
    return tp / (tp + fn)
def f1(precision, recall):
    return 2 * (precision * recall) / (precision + recall) \
                                        if precision + recall != 0 else 0

В этом задании мы будем использовать библиотеку torchtext. Она довольна проста в использовании и поможет нам сконцентрироваться на задаче, а не на написании Dataloader-а.

In [3]:
TEXT = Field(sequential=True, lower=True, include_lengths=True)  # Поле текста
LABEL = LabelField(dtype=torch.float)  # Поле метки

Датасет на котором мы будем проводить эксперементы это комментарии к фильмам из сайта IMDB.

In [4]:
train, test = datasets.IMDB.splits(TEXT, LABEL)  # загрузим датасет
train, valid = train.split(random_state=random.seed(SEED))  # разобьем на части

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:02<00:00, 29.6MB/s]


In [5]:
TEXT.build_vocab(train)
LABEL.build_vocab(train)

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test), 
    batch_size = 16,
    sort_within_batch = True,
    device = device)

## RNN

Для начала попробуем использовать рекурентные нейронные сети. На семинаре вы познакомились с GRU, вы можете также попробовать LSTM. Можно использовать для классификации как hidden_state, так и output последнего токена.

In [7]:
class MyRNN(nn.Module):
    def __init__(self, embed_size, hidden_size):
        super().__init__()

        self.embed_size = embed_size
        self.hidden_size = hidden_size

        self.w_h = nn.Parameter(torch.rand(hidden_size, hidden_size))
        self.b_h = nn.Parameter(torch.rand((1, hidden_size)))
        self.w_x = nn.Parameter(torch.rand(embed_size, hidden_size))
        self.b_x = nn.Parameter(torch.rand(1, hidden_size))
        self.w_yh = nn.Parameter(torch.rand(hidden_size, hidden_size))
        self.b_yh = nn.Parameter(torch.rand(1, hidden_size))

    def forward(self, x, hidden=None):
        '''
        x – torch.FloatTensor with the shape (seq_length, bs, emb_size)
        hidden - torch.FloatTensro with the shape (bs, hidden_size)
        return: torch.FloatTensor with the shape (bs, hidden_size)
        '''
        if hidden is None:
            hidden = torch.zeros((x.size(1), self.hidden_size)).to(x.device)
        seq_length = x.size(0)
        for cur_idx in range(seq_length):
            hidden = torch.tanh(
                x[cur_idx] @ self.w_x + self.b_x + hidden @ self.w_h + self.b_h
            )
        y = torch.tanh(
            hidden @ self.w_yh + self.b_yh
        )
        return y

In [8]:
class RNNModel(nn.Module):
    def __init__(
            self, 
            vocab_size, 
            embedding_dim, 
            hidden_dim, 
            output_dim, 
            n_layers, 
            bidirectional, 
            dropout, 
            pad_idx
        ):
        super().__init__()
        self.bidirectional = bidirectional
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        # self.rnn = MyRNN(embed_size=embedding_dim, hidden_size=hidden_dim)
        # self.rnn = nn.RNN(
        #     input_size=embedding_dim,
        #     hidden_size=hidden_dim,
        # )
        self.rnn = nn.GRU(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout
        )
        # self.rnn = nn.LSTM(
        #     input_size=embedding_dim,
        #     hidden_size=hidden_dim,
        #     num_layers=n_layers,
        #     bidirectional=bidirectional,
        #     dropout=dropout
        # )
        n_directions = 2 if bidirectional else 1
        self.fc = nn.Linear(hidden_dim * n_directions, output_dim)
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, text, text_lengths):
        '''
        text: [sent len, batch size]
        embedded: [sent len, batch size, emb dim]
        hidden: [num layers * num directions, batch size, hid dim]
        cell: [num layers * num directions, batch size, hid dim]
        output: [sent len, batch size, hid dim * num directions]
        hidden: [batch size, hid dim * num directions]
        '''
        embedded = self.embedding(text)
        
        # pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu())
        
        # cell arg for LSTM, remove for GRU
        packed_output, hidden = self.rnn(packed_embedded)  # packed_output, (hidden, cell)

        # unpack sequence, output over padding tokens are zero tensors
        # output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        
        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        #and apply dropout
        if self.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        else:
            hidden = hidden[-1]

        fc = self.fc(hidden)
            
        return torch.sigmoid(fc)

Поиграйтесь с гиперпараметрами

In [9]:
model = RNNModel(
    vocab_size=len(TEXT.vocab),
    embedding_dim=100,
    hidden_dim=256,
    output_dim=1,
    n_layers=1,
    bidirectional=False,
    dropout=0.2,
    pad_idx=TEXT.vocab.stoi[TEXT.pad_token]
).to(device)

optimizer = torch.optim.Adam(model.parameters())
loss_func = nn.BCELoss()

max_epochs = 20
patience=3

  "num_layers={}".format(dropout, num_layers))


Обучите сетку! Используйте любые вам удобные инструменты, Catalyst, PyTorch Lightning или свои велосипеды.

In [10]:
import numpy as np
from copy import deepcopy

min_loss = np.inf
max_f1 = 0
cur_patience = 0
THRSH = 0.5

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model.train()
    pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar:
        (texts, text_lengths), labels = batch
        optimizer.zero_grad()
        prediction = model(texts, text_lengths).squeeze()
        loss = loss_func(prediction, labels)
        loss.backward()
        train_loss += loss
        optimizer.step()
        # break
    train_loss /= len(train_iter)
    val_loss = 0.0
    fp, fn, tp = 0, 0, 0
    model.eval()
    pbar = tqdm(enumerate(valid_iter), total=len(valid_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    with torch.no_grad():
        for it, batch in pbar:
            (texts, text_lengths), labels = batch
            prediction = model(texts, text_lengths).squeeze()
            val_loss += loss_func(prediction, labels)
            conf_matrix = confusion_matrix(labels.cpu(), prediction.detach().cpu() > THRSH)
            if len(conf_matrix) != 1:
                _, fp_batch, fn_batch, tp_batch = conf_matrix.ravel()
            else:
                fp_batch, fn_batch, tp_batch = 0, 0, conf_matrix.item()
            fp += fp_batch
            fn += fn_batch
            tp += tp_batch
    val_loss /= len(valid_iter)
    val_f1 = f1(precision(tp, fp), recall(tp, fn))
    if val_loss < min_loss:
        min_loss = val_loss
        best_loss_model = deepcopy(model.state_dict())
    if val_f1 > max_f1:
        max_f1 = val_f1
        best_f1_model = deepcopy(model.state_dict())
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    print(f'Epoch: {epoch}, Training Loss: {train_loss}, Validation Loss: {val_loss}, Validation F1: {val_f1}')
    # break

  0%|          | 0/1094 [00:00<?, ?it/s]

  0%|          | 0/469 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.6184461116790771, Validation Loss: 0.45135852694511414, Validation F1: 0.7742974238875878


  0%|          | 0/1094 [00:00<?, ?it/s]

  0%|          | 0/469 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.29849526286125183, Validation Loss: 0.3417336046695709, Validation F1: 0.8685125254704542


  0%|          | 0/1094 [00:00<?, ?it/s]

  0%|          | 0/469 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.11641838401556015, Validation Loss: 0.37397289276123047, Validation F1: 0.8550884955752213


  0%|          | 0/1094 [00:00<?, ?it/s]

  0%|          | 0/469 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.029887281358242035, Validation Loss: 0.5145747661590576, Validation F1: 0.8706748466257669


  0%|          | 0/1094 [00:00<?, ?it/s]

  0%|          | 0/469 [00:00<?, ?it/s]

Epoch: 5, Training Loss: 0.011460289359092712, Validation Loss: 0.6096621751785278, Validation F1: 0.8769309989701338


  0%|          | 0/1094 [00:00<?, ?it/s]

  0%|          | 0/469 [00:00<?, ?it/s]

Epoch: 6, Training Loss: 0.00270083243958652, Validation Loss: 0.6903601288795471, Validation F1: 0.8693840343794764


  0%|          | 0/1094 [00:00<?, ?it/s]

  0%|          | 0/469 [00:00<?, ?it/s]

In [11]:
min_loss, max_f1

(tensor(0.3417, device='cuda:0'), 0.8769309989701338)

In [12]:
model.load_state_dict(best_f1_model)
# model.load_state_dict(best_loss_model)

<All keys matched successfully>

In [13]:
test_loss = 0
fp, fn, tp = 0, 0, 0

pbar = tqdm(enumerate(test_iter), total=len(test_iter), leave=False)
model.eval()
with torch.no_grad():
    for it, batch in pbar:
        (texts, text_lengths), labels = batch
        prediction = model(texts, text_lengths).squeeze()
        test_loss += loss_func(prediction, labels)
        conf_matrix = confusion_matrix(labels.cpu(), prediction.detach().cpu() > THRSH)
        if len(conf_matrix) > 1:
            _, fp_batch, fn_batch, tp_batch = conf_matrix.ravel()
        else:
            fp_batch, fn_batch, tp_batch = 0, 0, conf_matrix.item()
        fp += fp_batch
        fn += fn_batch
        tp += tp_batch
test_loss /= len(test_iter)
test_f1 = f1(precision(tp, fp), recall(tp, fn))
print(f'Test Loss: {test_loss}, Validation F1: {test_f1}')

  0%|          | 0/1563 [00:00<?, ?it/s]

Test Loss: 0.6557254195213318, Validation F1: 0.8647677308354044


Посчитайте f1-score вашего классификатора на тестовом датасете.

**Ответ**:

Модель|Training Loss|Best Validation Loss|Best Validation F1|Test Loss (best f1)|Test Loss (best loss)|Test F1 (best f1)|Test F1 (best loss)|Примечания
-|-|-|-|-|-|-|-|-
MyRNN|0.69 не падает|0.69|0.57|0.69|0.69|0.58|0.38|Сеть не учится
RNN|0.69 не падает|0.67|0.56|0.69|0.67|0.55|0.54|Сеть не учится, вычисления быстрее
GRU|0.67-0.02|0.38|0.85|0.57|0.40|0.84|0.83|Сеть учится
GRU+packed|0.66-0.005|0.33|0.86|0.64|0.34|0.85|0.84|Результат улушился, будем использовать упаковку
LSTM|0.67-0.02|0.46|0.83|0.6|0.48|0.82|0.80|Больше эпох, хуже результат,  дальше используем GRU
embedding_dim=300|0.65-0.005|0.36|0.86|0.68|0.37|0.85|0.84|Прироста нет, оставляем 100
hidden_dim=400|0.67-0.05|0.32|0.869|0.34|0.34|0.85|0.85|Прирост незначительный, оставляем 256 (проверить 512 не удалось из-за ограничений памяти)
2 layers|0.63-0.01|0.34|0.867|0.68|0.35|0.857|0.849|f1 увеличилась не сильно, однако, на валидации f1 стабильно выше 0.86; батч уменьшен до 32
2 layers+dropout 0.2|0.6-0.008|0.35|0.87|0.71|0.37|0.86|0.85|Качество улучшилось;батч=16;однако, время обучения возросло несоизмеримо качеству
2 layers+dropout 0.2+bidirectional|0.57-0.02|0.33|0.87|0.49|0.34|0.86|0.85|Качество сильно не улучшилось
LSTM|0.62-0.01|0.40|0.85|0.57|0.42|0.84|0.81|LSTM опять показала результат хуже GRU

Увеличение числа слоёв даёт прирост качества, однако, для этого приходится уменьшать размер батча, что ощутимо сказывается на скорости обучения.

## CNN

![](https://www.researchgate.net/publication/333752473/figure/fig1/AS:769346934673412@1560438011375/Standard-CNN-on-text-classification.png)

Для классификации текстов также часто используют сверточные нейронные сети. Идея в том, что как правило сентимент содержат словосочетания из двух-трех слов, например "очень хороший фильм" или "невероятная скука". Проходясь сверткой по этим словам мы получим какой-то большой скор и выхватим его с помощью MaxPool. Далее идет обычная полносвязная сетка. Важный момент: свертки применяются не последовательно, а параллельно. Давайте попробуем!

In [1]:
import pandas as pd
import numpy as np
import torch

from torchtext.legacy import datasets
from torchtext.legacy.data import Field, LabelField, BucketIterator, dataset
from torchtext.vocab import Vectors, GloVe

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
from tqdm.autonotebook import tqdm

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

from sklearn.metrics import confusion_matrix

def precision(tp, fp):
    pp = tp + fp
    return (tp / pp) if pp != 0 else 0
def recall(tp, fn):
    return tp / (tp + fn)
def f1(precision, recall):
    return 2 * (precision * recall) / (precision + recall) \
                                        if precision + recall != 0 else 0

In [2]:
TEXT = Field(sequential=True, lower=True, batch_first=True)  # batch_first тк мы используем conv  
LABEL = LabelField(batch_first=True, dtype=torch.float)

train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split(random_state=random.seed(SEED))

TEXT.build_vocab(trn)
LABEL.build_vocab(trn)

device = "cuda" if torch.cuda.is_available() else "cpu"

In [3]:
train_iter, valid_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(128, 256, 256),
        sort=False,
        sort_key= lambda x: len(x.src),
        sort_within_batch=False,
        device=device,
        repeat=False,
)

Вы можете использовать Conv2d с `in_channels=1, kernel_size=(kernel_sizes[0], emb_dim))` или Conv1d c `in_channels=emb_dim, kernel_size=kernel_size[0]`. Но хорошенько подумайте над shape в обоих случаях.

In [4]:
class CNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        emb_dim,
        out_channels,
        kernel_sizes,
        dropout=0.5,
    ):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.conv_0 = nn.Sequential(
            nn.Conv1d(in_channels=emb_dim, out_channels=out_channels, kernel_size=kernel_sizes[0], padding=1, stride=2),
            # nn.BatchNorm1d(out_channels),
        )
        self.conv_1 = nn.Sequential(
            nn.Conv1d(in_channels=emb_dim, out_channels=out_channels, kernel_size=kernel_sizes[1], padding=1, stride=1),
            # nn.BatchNorm1d(out_channels),
        )
        self.conv_2 = nn.Sequential(
            nn.Conv1d(in_channels=emb_dim, out_channels=out_channels, kernel_size=kernel_sizes[2], padding=1, stride=1),
            # nn.BatchNorm1d(out_channels),
        )
        self.fc = nn.Linear(len(kernel_sizes) * out_channels, 1)
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, text):
        
        embedded = self.embedding(text)
        embedded = embedded.movedim(2, 1)
        
        conved_0 = F.relu(self.conv_0(embedded))
        conved_1 = F.relu(self.conv_1(embedded))
        conved_2 = F.relu(self.conv_2(embedded))

        # print('conved_0:', conved_0.shape)
        # print('conved_1:', conved_1.shape)
        # print('conved_2:', conved_2.shape)
        
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
        pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)

        # print('pooled_0:', pooled_0.shape)
        # print('pooled_1:', pooled_1.shape)
        # print('pooled_2:', pooled_2.shape)
        
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))

        # print('cat:', cat.shape)
        
        fc = self.fc(cat)
            
        return torch.sigmoid(fc)

In [5]:
max_epochs = 30
patience = 3

model = CNN(
    vocab_size=len(TEXT.vocab), 
    emb_dim=300, 
    out_channels=128,
    kernel_sizes=[2, 3, 4], 
    dropout=0.5
).to(device)

optimizer = torch.optim.Adam(model.parameters())
loss_func = nn.BCELoss()

Обучите!

In [6]:
import numpy as np
from copy import deepcopy

min_loss = np.inf
max_f1 = 0
cur_patience = 0
THRSH = 0.5

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model.train()
    pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar:
        texts, labels = batch
        optimizer.zero_grad()
        prediction = model(texts).squeeze()
        # break
        loss = loss_func(prediction, labels)
        loss.backward()
        train_loss += loss
        optimizer.step()
        # break
    # break
    train_loss /= len(train_iter)
    val_loss = 0.0
    fp, fn, tp = 0, 0, 0
    model.eval()
    pbar = tqdm(enumerate(valid_iter), total=len(valid_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    with torch.no_grad():
        for it, batch in pbar:
            texts, labels = batch
            prediction = model(texts).squeeze()
            val_loss += loss_func(prediction, labels)
            conf_matrix = confusion_matrix(labels.cpu(), prediction.detach().cpu() > THRSH)
            if len(conf_matrix) != 1:
                _, fp_batch, fn_batch, tp_batch = conf_matrix.ravel()
            else:
                fp_batch, fn_batch, tp_batch = 0, 0, conf_matrix.item()
            fp += fp_batch
            fn += fn_batch
            tp += tp_batch
    val_loss /= len(valid_iter)
    val_f1 = f1(precision(tp, fp), recall(tp, fn))
    if val_loss < min_loss:
        min_loss = val_loss
        best_loss_model = deepcopy(model.state_dict())
    if val_f1 > max_f1:
        max_f1 = val_f1
        best_f1_model = deepcopy(model.state_dict())
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    print(f'Epoch: {epoch}, Training Loss: {train_loss}, Validation Loss: {val_loss}, Validation F1: {val_f1}')
    # break

  0%|          | 0/137 [00:00<?, ?it/s]

  return torch.max_pool1d(input, kernel_size, stride, padding, dilation, ceil_mode)


  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.6353393793106079, Validation Loss: 0.461436003446579, Validation F1: 0.8002605863192183


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.47869154810905457, Validation Loss: 0.40366503596305847, Validation F1: 0.830327760958273


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.4071365296840668, Validation Loss: 0.36657556891441345, Validation F1: 0.853454821564161


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.35030749440193176, Validation Loss: 0.3464575707912445, Validation F1: 0.8544152744630072


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 5, Training Loss: 0.29146167635917664, Validation Loss: 0.33163371682167053, Validation F1: 0.856297889501277


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 6, Training Loss: 0.2352072149515152, Validation Loss: 0.3163353204727173, Validation F1: 0.8655439721999466


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 7, Training Loss: 0.1823510378599167, Validation Loss: 0.3123776316642761, Validation F1: 0.8699003322259138


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 8, Training Loss: 0.14280658960342407, Validation Loss: 0.3191465139389038, Validation F1: 0.8763017526035052


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 9, Training Loss: 0.10532651841640472, Validation Loss: 0.3221154808998108, Validation F1: 0.8749196554827099


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 10, Training Loss: 0.07461827993392944, Validation Loss: 0.34008124470710754, Validation F1: 0.8752598752598753


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

In [7]:
min_loss, max_f1

(tensor(0.3124, device='cuda:0'), 0.8763017526035052)

In [8]:
model.load_state_dict(best_f1_model)
# model.load_state_dict(best_loss_model)

<All keys matched successfully>

In [43]:
test_loss = 0
fp, fn, tp = 0, 0, 0

pbar = tqdm(enumerate(test_iter), total=len(test_iter), leave=False)
model.eval()
with torch.no_grad():
    for it, batch in pbar:
        texts, labels = batch
        prediction = model(texts).squeeze()
        test_loss += loss_func(prediction, labels)
        conf_matrix = confusion_matrix(labels.cpu(), prediction.detach().cpu() > THRSH)
        if len(conf_matrix) > 1:
            _, fp_batch, fn_batch, tp_batch = conf_matrix.ravel()
        else:
            fp_batch, fn_batch, tp_batch = 0, 0, conf_matrix.item()
        fp += fp_batch
        fn += fn_batch
        tp += tp_batch
test_loss /= len(test_iter)
test_f1 = f1(precision(tp, fp), recall(tp, fn))
print(f'Test Loss: {test_loss}, Validation F1: {test_f1}')

  0%|          | 0/98 [00:00<?, ?it/s]

Test Loss: 0.3257286548614502, Validation F1: 0.8705636743215032


Посчитайте f1-score вашего классификатора.

**Ответ**:

Модель|Training Loss|Best Validation Loss|Best Validation F1|Test Loss (best f1)|Test Loss (best loss)|Test F1 (best f1)|Test F1 (best loss)|Примечания
-|-|-|-|-|-|-|-|-
CNN|0.62-0.07|0.31|0.87|0.32|0.31|0.87|0.87|Результат лучше, чем у RNN; размеры ядер: 2, 3, 4

## Интерпретируемость

Посмотрим, куда смотрит наша модель. Достаточно запустить код ниже.

In [20]:
!pip install -q captum

[?25l[K     |▎                               | 10 kB 20.3 MB/s eta 0:00:01[K     |▌                               | 20 kB 11.3 MB/s eta 0:00:01[K     |▊                               | 30 kB 9.2 MB/s eta 0:00:01[K     |█                               | 40 kB 8.4 MB/s eta 0:00:01[K     |█▏                              | 51 kB 5.7 MB/s eta 0:00:01[K     |█▍                              | 61 kB 5.9 MB/s eta 0:00:01[K     |█▋                              | 71 kB 5.2 MB/s eta 0:00:01[K     |██                              | 81 kB 5.8 MB/s eta 0:00:01[K     |██▏                             | 92 kB 5.7 MB/s eta 0:00:01[K     |██▍                             | 102 kB 5.5 MB/s eta 0:00:01[K     |██▋                             | 112 kB 5.5 MB/s eta 0:00:01[K     |██▉                             | 122 kB 5.5 MB/s eta 0:00:01[K     |███                             | 133 kB 5.5 MB/s eta 0:00:01[K     |███▎                            | 143 kB 5.5 MB/s eta 0:00:01[K   

In [9]:
from captum.attr import LayerIntegratedGradients, TokenReferenceBase, visualization

PAD_IND = TEXT.vocab.stoi['pad']

token_reference = TokenReferenceBase(reference_token_idx=PAD_IND)
lig = LayerIntegratedGradients(model, model.embedding)

In [10]:
def forward_with_softmax(inp):
    logits = model(inp)
    return torch.softmax(logits, 0)[0][1]

def forward_with_sigmoid(input):
    return torch.sigmoid(model(input))


# accumalate couple samples in this array for visualization purposes
vis_data_records_ig = []

def interpret_sentence(model, sentence, min_len = 7, label = 0):
    model.eval()
    text = [tok for tok in TEXT.tokenize(sentence)]
    if len(text) < min_len:
        text += ['pad'] * (min_len - len(text))
    indexed = [TEXT.vocab.stoi[t] for t in text]

    model.zero_grad()

    input_indices = torch.tensor(indexed, device=device)
    input_indices = input_indices.unsqueeze(0)
    
    # input_indices dim: [sequence_length]
    seq_length = min_len

    # predict
    # pred = forward_with_sigmoid(input_indices).item()
    pred = model(input_indices).item()  # сигмоида уже есть в модели
    pred_ind = round(pred)

    # generate reference indices for each sample
    reference_indices = token_reference.generate_reference(seq_length, device=device).unsqueeze(0)

    # compute attributions and approximation delta using layer integrated gradients
    attributions_ig, delta = lig.attribute(input_indices, reference_indices, \
                                           n_steps=5000, return_convergence_delta=True)

    print('pred: ', LABEL.vocab.itos[pred_ind], '(', '%.2f'%pred, ')', ', delta: ', abs(delta))

    add_attributions_to_visualizer(attributions_ig, text, pred, pred_ind, label, delta, vis_data_records_ig)
    
def add_attributions_to_visualizer(attributions, text, pred, pred_ind, label, delta, vis_data_records):
    attributions = attributions.sum(dim=2).squeeze(0)
    attributions = attributions / torch.norm(attributions)
    attributions = attributions.cpu().detach().numpy()

    # storing couple samples in an array for visualization purposes
    vis_data_records.append(visualization.VisualizationDataRecord(
                            attributions,
                            pred,
                            LABEL.vocab.itos[pred_ind],
                            LABEL.vocab.itos[label],
                            LABEL.vocab.itos[1],
                            attributions.sum(),       
                            text,
                            delta))

In [11]:
interpret_sentence(model, 'It was a fantastic performance !', label=1)
interpret_sentence(model, 'Best film ever', label=1)
interpret_sentence(model, 'Such a great show!', label=1)
interpret_sentence(model, 'It was a horrible movie', label=0)
interpret_sentence(model, 'I\'ve never watched something as bad', label=0)
interpret_sentence(model, 'It is a disgusting movie!', label=0)

pred:  pos ( 0.60 ) , delta:  tensor([7.4809e-07], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([3.3818e-09], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.13 ) , delta:  tensor([5.5045e-07], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([2.8027e-09], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.03 ) , delta:  tensor([2.8645e-07], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([2.6333e-08], device='cuda:0', dtype=torch.float64)


Попробуйте добавить свои примеры!

In [None]:
print('Visualize attributions based on Integrated Gradients')
visualization.visualize_text(vis_data_records_ig)

Visualize attributions based on Integrated Gradients


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.95),pos,0.82,It was a fantastic performance ! pad
,,,,
pos,pos (0.58),pos,0.46,Best film ever pad pad pad pad
,,,,
pos,pos (1.00),pos,1.22,Such a great show! pad pad pad
,,,,
neg,neg (0.02),pos,-0.87,It was a horrible movie pad pad
,,,,
neg,neg (0.02),pos,-0.7,I've never watched something as bad pad
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.95),pos,0.82,It was a fantastic performance ! pad
,,,,
pos,pos (0.58),pos,0.46,Best film ever pad pad pad pad
,,,,
pos,pos (1.00),pos,1.22,Such a great show! pad pad pad
,,,,
neg,neg (0.02),pos,-0.87,It was a horrible movie pad pad
,,,,
neg,neg (0.02),pos,-0.7,I've never watched something as bad pad
,,,,


## Эмбеддинги слов

Вы ведь не забыли, как мы можем применить знания о word2vec и GloVe. Давайте попробуем!

In [13]:
TEXT.build_vocab(trn, vectors=GloVe())
LABEL.build_vocab(trn)

word_embeddings = TEXT.vocab.vectors

In [14]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split(random_state=random.seed(SEED))

device = "cuda" if torch.cuda.is_available() else "cpu"
train_iter, valid_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(128, 256, 256),
        sort=False,
        sort_key= lambda x: len(x.src),
        sort_within_batch=False,
        device=device,
        repeat=False,
)

In [15]:
model = CNN(
    vocab_size=len(TEXT.vocab), 
    emb_dim=300, 
    out_channels=128,
    kernel_sizes=[2, 3, 4], 
    dropout=0.5)

prev_shape = model.embedding.weight.shape

model.embedding.weight.data.copy_(word_embeddings)

assert prev_shape == model.embedding.weight.shape
model.to(device)

optimizer = torch.optim.Adam(model.parameters())

Вы знаете, что делать.

In [16]:
import numpy as np
from copy import deepcopy

min_loss = np.inf
max_f1 = 0
cur_patience = 0
THRSH = 0.5

for epoch in range(1, max_epochs + 1):
    train_loss = 0.0
    model.train()
    pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    for it, batch in pbar:
        texts, labels = batch
        optimizer.zero_grad()
        prediction = model(texts).squeeze()
        # break
        loss = loss_func(prediction, labels)
        loss.backward()
        train_loss += loss
        optimizer.step()
        # break
    # break
    train_loss /= len(train_iter)
    val_loss = 0.0
    fp, fn, tp = 0, 0, 0
    model.eval()
    pbar = tqdm(enumerate(valid_iter), total=len(valid_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}")
    with torch.no_grad():
        for it, batch in pbar:
            texts, labels = batch
            prediction = model(texts).squeeze()
            val_loss += loss_func(prediction, labels)
            conf_matrix = confusion_matrix(labels.cpu(), prediction.detach().cpu() > THRSH)
            if len(conf_matrix) != 1:
                _, fp_batch, fn_batch, tp_batch = conf_matrix.ravel()
            else:
                fp_batch, fn_batch, tp_batch = 0, 0, conf_matrix.item()
            fp += fp_batch
            fn += fn_batch
            tp += tp_batch
    val_loss /= len(valid_iter)
    val_f1 = f1(precision(tp, fp), recall(tp, fn))
    if val_loss < min_loss:
        min_loss = val_loss
        best_loss_model = deepcopy(model.state_dict())
    if val_f1 > max_f1:
        max_f1 = val_f1
        best_f1_model = deepcopy(model.state_dict())
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    print(f'Epoch: {epoch}, Training Loss: {train_loss}, Validation Loss: {val_loss}, Validation F1: {val_f1}')
    # break

  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.475992351770401, Validation Loss: 0.3353918790817261, Validation F1: 0.8595528183157048


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.2830614745616913, Validation Loss: 0.2900640070438385, Validation F1: 0.8841804237618615


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.1662633717060089, Validation Loss: 0.28855374455451965, Validation F1: 0.8834388185654007


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.069611556828022, Validation Loss: 0.3138796091079712, Validation F1: 0.8817373103087388


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

In [28]:
min_loss, max_f1

(tensor(0.2893, device='cuda:0'), 0.8852459016393442)

In [21]:
model.load_state_dict(best_f1_model)
# model.load_state_dict(best_loss_model)

<All keys matched successfully>

In [32]:
test_loss = 0
fp, fn, tp = 0, 0, 0

pbar = tqdm(enumerate(test_iter), total=len(test_iter), leave=False)
model.eval()
with torch.no_grad():
    for it, batch in pbar:
        texts, labels = batch
        prediction = model(texts).squeeze()
        test_loss += loss_func(prediction, labels)
        conf_matrix = confusion_matrix(labels.cpu(), prediction.detach().cpu() > THRSH)
        if len(conf_matrix) > 1:
            _, fp_batch, fn_batch, tp_batch = conf_matrix.ravel()
        else:
            fp_batch, fn_batch, tp_batch = 0, 0, conf_matrix.item()
        fp += fp_batch
        fn += fn_batch
        tp += tp_batch
test_loss /= len(test_iter)
test_f1 = f1(precision(tp, fp), recall(tp, fn))
print(f'Test Loss: {test_loss}, Validation F1: {test_f1}')

  0%|          | 0/98 [00:00<?, ?it/s]

Test Loss: 0.2850239872932434, Validation F1: 0.884931082546074


Посчитайте f1-score вашего классификатора.

**Ответ**:

Модель|Training Loss|Best Validation Loss|Best Validation F1|Test Loss (best f1)|Test Loss (best loss)|Test F1 (best f1)|Test F1 (best loss)|Примечания
-|-|-|-|-|-|-|-|-
CNN(pretrained embeddings)|0.47-0.02|0.28|0.88|0.28|0.28|0.884|0.884|Результат улучшился

Проверим насколько все хорошо!

In [23]:
PAD_IND = TEXT.vocab.stoi['pad']

token_reference = TokenReferenceBase(reference_token_idx=PAD_IND)
lig = LayerIntegratedGradients(model, model.embedding)
vis_data_records_ig = []

interpret_sentence(model, 'It was a fantastic performance !', label=1)
interpret_sentence(model, 'Best film ever', label=1)
interpret_sentence(model, 'Such a great show!', label=1)
interpret_sentence(model, 'It was a horrible movie', label=0)
interpret_sentence(model, 'I\'ve never watched something as bad', label=0)
interpret_sentence(model, 'It is a disgusting movie!', label=0)

pred:  pos ( 0.99 ) , delta:  tensor([2.7446e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.11 ) , delta:  tensor([2.3294e-08], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.66 ) , delta:  tensor([8.7727e-06], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([1.7405e-06], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.05 ) , delta:  tensor([1.2509e-06], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.02 ) , delta:  tensor([1.8553e-06], device='cuda:0', dtype=torch.float64)


In [24]:
print('Visualize attributions based on Integrated Gradients')
visualization.visualize_text(vis_data_records_ig)

Visualize attributions based on Integrated Gradients


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.99),pos,1.55,It was a fantastic performance ! pad
,,,,
pos,neg (0.11),pos,1.22,Best film ever pad pad pad pad
,,,,
pos,pos (0.66),pos,1.29,Such a great show! pad pad pad
,,,,
neg,neg (0.00),pos,-0.91,It was a horrible movie pad pad
,,,,
neg,neg (0.05),pos,-0.23,I've never watched something as bad pad
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.99),pos,1.55,It was a fantastic performance ! pad
,,,,
pos,neg (0.11),pos,1.22,Best film ever pad pad pad pad
,,,,
pos,pos (0.66),pos,1.29,Such a great show! pad pad pad
,,,,
neg,neg (0.00),pos,-0.91,It was a horrible movie pad pad
,,,,
neg,neg (0.05),pos,-0.23,I've never watched something as bad pad
,,,,
