# MVD 10. cvičení
V dnešním cvičení rozšíříme BERT model o jednu FCNN vrstvu, která se dotrénuje na úloze klasifikace sentimentu. Bude se vycházet z předtrénovaného BERT modelu a přidaná vrstva se bude trénovat s Pytorch knihovnou.

In [1]:
import random
import torch
import numpy as np
import pandas as pd

try:
    from distutils.dir_util import copy_tree

    copy_tree('../input/imdb-dataset-sentiment-analysis-in-csv-format/', '../working/data/')
except:
    pass

In [2]:
# Parametry trénování
epochs = 20
lr = 1e-6
batch_size = 16

## 1. část - Příprava dat pro detekci sentimentu

Stáhněte si data filmových recenzí na [Kaggle datasets](https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format) a rozbalte je do adresáře data. Spusťte a projděte následující kód.

In [3]:
# Vybrána menší množina dat -> lze spustit i na cpu
train_df = pd.read_csv('data/Train.csv').head(1000)
train_df['type'] = 'train'
test_df = pd.read_csv('data/Test.csv').head(100)
test_df['type'] = 'test'

data_df = pd.concat([train_df, test_df])

print(data_df.shape)
print(data_df.head())
print(data_df.text.iloc[0])
# Kontrola vyvážení datasetu
print(data_df.label.value_counts())

(1100, 3)
                                                text  label   type
0  I grew up (b. 1965) watching and loving the Th...      0  train
1  When I put this movie in my DVD player, and sa...      0  train
2  Why do people who do not know what a particula...      0  train
3  Even though I have great interest in Biblical ...      0  train
4  Im a die hard Dads Army fan and nothing will e...      1  train
I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of t

In [4]:
# Nastavení random seedu pro možnost reprodukce experimentu
seed = 42
random.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

In [5]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = True)
encoder_train = tokenizer.batch_encode_plus(data_df[data_df['type']=='train'].text.values,
                                           add_special_tokens = True,
                                           padding = True,
                                           truncation=True,
                                           return_tensors = 'pt')

encoder_test = tokenizer.batch_encode_plus(data_df[data_df['type']=='test'].text.values,
                                           add_special_tokens = True,
                                           padding = True,
                                           truncation=True,
                                           return_tensors = 'pt')

input_ids_train = encoder_train['input_ids']
attention_masks_train = encoder_train["attention_mask"]
labels_train = torch.tensor(data_df[data_df['type'] == 'train'].label.values)

input_ids_test = encoder_test['input_ids']
attention_masks_test = encoder_test["attention_mask"]
labels_test = torch.tensor(data_df[data_df['type'] == 'test'].label.values)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

### Attention masks
Attention masky používáme z důvodu paddingu (doplnění nul do určité délky). Attention maska obsahuje 0 nebo 1 a říká modelu, kde se nachází původní vstup a kde padding, na který se nemá zaměřovat. 


In [6]:
from torch.utils.data import RandomSampler, SequentialSampler, DataLoader

data_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
data_test = TensorDataset(input_ids_test, attention_masks_test, labels_test)

dataloader_train = DataLoader(data_train, shuffle = True, batch_size = batch_size)
dataloader_test = DataLoader(data_test, batch_size = batch_size)

## 2. část - Příprava modelu a trénování

In [7]:
from torch.nn import Module, Linear
from transformers import BertModel, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup


class ExtendedBert(Module):
    def __init__(self):
        super(ExtendedBert, self).__init__()
        
        self.bert = BertForSequenceClassification.from_pretrained('bert-base-uncased')
        self.bert.classifier = Linear(in_features=768, out_features=2, bias=True)
    
    
    def forward(self, **kwargs):
          return self.bert(**kwargs)

In [8]:
# model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2)
model = ExtendedBert()
optimizer = AdamW(model.parameters(), lr = lr)
scheduler = get_linear_schedule_with_warmup(
                optimizer,
                num_warmup_steps = 0,
                num_training_steps = len(dataloader_train)*epochs 
            )

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

### BertForSequenceClassification
Tato třída již obsahuje jednu přidanou lineární vrstvu se 2 výstupními třídami (viz parametr num_labels). Tuto vrstvu uvidíte na konci seznamu vrstev.

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Device: {}".format(device))
model.to(device)

Device: cuda


ExtendedBert(
  (bert): BertForSequenceClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)

In [10]:
from tqdm.notebook import tqdm

for epoch in tqdm(range(epochs)):
    model.train()
    train_loss = 0
    
    with tqdm(dataloader_train, desc = f'Epoch: {epoch}') as pbar:
        for batch in pbar:
            model.zero_grad()

            inputs = {
                "input_ids" : batch[0].to(device),
                "attention_mask" : batch[1].to(device),
                "labels" : batch[2].to(device)
            }
            outputs = model(**inputs)

            loss = outputs[0]
            # outputs[1] -> logits
            train_loss += loss.item()
            loss.backward()

            optimizer.step()
            scheduler.step()

            pbar.set_postfix({'training_loss':f'{(loss.item() / len(batch)):.3f}'})
    
    loss_train_avg = train_loss / len(dataloader_train)
    tqdm.write(f'Training Loss: {loss_train_avg}')

  0%|          | 0/20 [00:00<?, ?it/s]

Epoch: 0:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.6804857736542111


Epoch: 1:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.6360361140871805


Epoch: 2:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.5870602471487862


Epoch: 3:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.5413768925364055


Epoch: 4:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.4865164714200156


Epoch: 5:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.4285193993931725


Epoch: 6:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.3933138951422676


Epoch: 7:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.3589331599928084


Epoch: 8:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.33000900750122375


Epoch: 9:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.2965634020548018


Epoch: 10:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.28943150076601243


Epoch: 11:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.2765860555190889


Epoch: 12:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.26282867245257846


Epoch: 13:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.2503333179250596


Epoch: 14:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.25152967704666984


Epoch: 15:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.23515189510016216


Epoch: 16:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.23181545214047508


Epoch: 17:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.2332937940955162


Epoch: 18:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.23354221513820073


Epoch: 19:   0%|          | 0/63 [00:00<?, ?it/s]

Training Loss: 0.22259533310693408


## 3. Část - Vyhodnocení
Vytvořte funkci `eval`, která bude přijímat jako vstupní parametr test (nebo valid) DataLoader. V této funkci projdete data v evaluačním módu modelu, spočítáte loss na daných datech a využijete výstup `output[1] # logits` pro výpočet úspěšnosti (accuracy) a f1-skóre (pomocí sklearn funkce).

In [11]:
from sklearn.metrics import f1_score

In [12]:
def evaluate(model, dataloader):
    model.eval()
    eval_loss = 0.
    eval_len = 0
    eval_steps = 0
    eval_f1 = 0.
    
    with tqdm(dataloader, desc = f'Evaluating') as pbar:
        with torch.no_grad():
            for batch in pbar:
                inputs = {
                    "input_ids" : batch[0].to(device),
                    "attention_mask" : batch[1].to(device),
                    "labels" : batch[2].to(device)
                }
                
                outputs = model(**inputs)
                loss = outputs[0]
                logits = torch.argmax(outputs[1].cpu(), axis=1)
                
                eval_f1 += f1_score(batch[2], logits)
                eval_steps += 1
                
                eval_loss += loss.item()
                eval_len += len(batch)
    
    print(f'Avg loss: {eval_loss / eval_len}')
    print(f'F1: {eval_f1 / eval_steps}')

evaluate(model, dataloader_test)

Evaluating:   0%|          | 0/7 [00:00<?, ?it/s]

Avg loss: 0.09837767978509267
F1: 0.8426852559205501


# BONUS: implementovat vlastní lineární vrstvu
`BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2)` -> `num_labels = 2` přidává lineární vrstvu se dvěma výstupy, tak to akorát odebrat a přidat `nn.Linear` místo toho.