<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Оценка-на-тесте" data-toc-modified-id="Оценка-на-тесте-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Оценка на тесте</a></span></li></ul></div>

# Определение эмоциональной окраски отзывов на IMDb с помощью BERT

In [1]:
pip install pytorch-transformers

Collecting pytorch-transformers
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.4/176.4 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: pytorch-transformers
Successfully installed pytorch-transformers-1.2.0
[0mNote: you may need to restart the kernel to use updated packages.


## Подготовка данных

Установка необходимых библиотек:

In [2]:
import io
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm, trange

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from keras.preprocessing.sequence import pad_sequences

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from pytorch_transformers import BertTokenizer, BertConfig
from pytorch_transformers import AdamW, BertForSequenceClassification

import warnings
warnings.filterwarnings("ignore")

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

Данные представляют собой 50 тысяч отзывов на фильмы с IMBD, половина из которых положительные.

In [4]:
dataset = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
dataset.sample(10)

Unnamed: 0,review,sentiment
29977,I always hated this retarded show .I liked the...,negative
10023,I've also been looking to find this movie for ...,positive
11015,This adaption contains two parts: <br /><br />...,negative
25910,One of the flat-out drollest movies of all-tim...,positive
40227,Fabulous costumes by Edith Head who painted th...,positive
29330,"OK, people, honestly... this gotta be one of t...",negative
28341,Empty shortening of John Irving's novel strive...,negative
39790,Time has not been kind to this movie. Once con...,positive
13411,The theme is controversial and the depiction o...,positive
24068,If you are having trouble sleeping or just wan...,negative


Выделим текстовую часть данных, каждый отзыв дополним специальными токенами:

In [5]:
sentences = dataset['review'].values
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]

Выделим и закодируем метки для отзывов:

In [6]:
labels = dataset['sentiment'].values
labels = [[1] if x == 'positive' else [0] for x in dataset['sentiment'].values]

Разделим данные на обучающие и тестовые:

In [7]:
train_sentences, test_sentences, train_gt, test_gt = train_test_split(sentences, labels, test_size=0.3)
len(train_gt), len(test_gt)

(35000, 15000)

Токенизируем текст с помощью BertTokenizer:

In [8]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in train_sentences]
print(tokenized_texts[0])

100%|██████████| 231508/231508 [00:00<00:00, 1247364.88B/s]


['[CLS]', 'this', 'film', 'is', 'roughly', 'what', 'it', 'sounds', 'like', ':', 'a', 'futuristic', 'version', 'of', 'the', 'cinderella', 'legend', 'but', 'with', 'songs', 'and', '(', 'fairly', 'tame', ')', 'sex', 'scenes', '!', 'the', 'film', 'is', 'not', 'sure', 'what', 'it', 'wants', 'to', 'be', 'and', 'pretty', 'much', 'ends', 'up', 'a', 'mess', '.', 'it', "'", 's', 'more', 'expensive', 'looking', 'than', 'most', 'of', 'director', 'al', 'adams', '##on', "'", 's', 'films', 'but', 'it', "'", 's', 'not', 'at', 'the', 'same', 'budget', 'level', 'that', 'viewers', 'have', 'come', 'to', 'expect', 'from', 'sci', '-', 'fi', 'films', '.', 'the', 'actors', 'are', 'pretty', 'bad', 'and', 'unlike', 'most', 'adams', '##on', 'films', ',', 'there', 'are', 'no', 'former', 'big', 'name', '##rs', 'or', 'b', 'actors', '.', 'some', 'of', 'the', 'music', 'is', 'ok', 'but', 'it', "'", 's', 'easy', 'to', 'see', 'why', 'cinderella', '2000', 'has', 'been', 'forgotten', 'for', 'so', 'many', 'years', '.', '[S

Векторизуем токенизированный текст, приведем предложения к одной длине с помощью обрезания и паддинга, создадим attention mask:

In [9]:
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(
    input_ids,
    maxlen=150,
    truncating="post",
    padding="post"
)

attention_masks = [[float(i>0) for i in seq] for seq in input_ids]

Разделим данные на тренировочные и валидационные:

In [10]:
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(
    input_ids, train_gt, 
    random_state=42,
    test_size=0.1
)

train_masks, validation_masks, _, _ = train_test_split(
    attention_masks,
    input_ids,
    random_state=42,
    test_size=0.1
)

In [11]:
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)

validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)

Используем DataLoader для побатчевой обработки данных:

In [12]:
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(
    train_data,
    sampler=RandomSampler(train_data),
    batch_size=32
)

In [13]:
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_dataloader = DataLoader(
    validation_data,
    sampler=SequentialSampler(validation_data),
    batch_size=32
)

## Обучение и валидация модели

Загрузим предобученный BertForSequenceClassification:

In [14]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda();

100%|██████████| 433/433 [00:00<00:00, 257315.62B/s]
100%|██████████| 440473133/440473133 [00:11<00:00, 37686981.88B/s]


В качестве оптимизатора будем использовать AdamW со следующими параметрами:

In [15]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)

Выполним дообучение модели:

In [16]:
train_loss_set = []
train_loss = 0

model.train()

for step, batch in enumerate(train_dataloader):
    batch = tuple(t.to(device) for t in batch)
    
    b_input_ids, b_input_mask, b_labels = batch

    optimizer.zero_grad()
    
    loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    train_loss_set.append(loss[0].item())  
    
    loss[0].backward()
 
    optimizer.step()

    train_loss += loss[0].item()
    
print(f"Loss на обучающей выборке: {train_loss / len(train_dataloader)}")

model.eval()

valid_preds, valid_labels = [], []

for batch in validation_dataloader:   
    batch = tuple(t.to(device) for t in batch)
    
    b_input_ids, b_input_mask, b_labels = batch
    
    with torch.no_grad():
        logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

    logits = logits[0].detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    
    batch_preds = np.argmax(logits, axis=1)
    batch_labels = np.concatenate(label_ids)     
    valid_preds.extend(batch_preds)
    valid_labels.extend(batch_labels)

print(f"Процент правильных предсказаний на валидационной выборке: {accuracy_score(valid_labels, valid_preds) * 100}%")

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /usr/local/src/pytorch/torch/csrc/utils/python_arg_parser.cpp:1055.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)


Loss на обучающей выборке: 0.32570635564814365
Процент правильных предсказаний на валидационной выборке: 90.14285714285715%


## Оценка качества модели

Оценим качество модели на отложенной выборке:

In [17]:
tokenized_texts = [tokenizer.tokenize(sent) for sent in test_sentences]
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

input_ids = pad_sequences(
    input_ids,
    maxlen=150,
    truncating="post",
    padding="post"
)

attention_masks = [[float(i>0) for i in seq] for seq in input_ids]

In [18]:
prediction_inputs = torch.tensor(input_ids)
prediction_masks = torch.tensor(attention_masks)
prediction_labels = torch.tensor(test_gt)

prediction_data = TensorDataset(
    prediction_inputs,
    prediction_masks,
    prediction_labels
)

prediction_dataloader = DataLoader(
    prediction_data, 
    sampler=SequentialSampler(prediction_data),
    batch_size=32
)

In [19]:
model.eval()
test_preds, test_labels = [], []

for batch in prediction_dataloader:
    batch = tuple(t.to(device) for t in batch)
    
    b_input_ids, b_input_mask, b_labels = batch
    
    with torch.no_grad():
        logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

    logits = logits[0].detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    batch_preds = np.argmax(logits, axis=1)
    batch_labels = np.concatenate(label_ids)  
    test_preds.extend(batch_preds)
    test_labels.extend(batch_labels)

In [20]:
acc_score = accuracy_score(test_labels, test_preds)
print(f'Процент правильных предсказаний на отложенной выборке составил: {acc_score*100}%')

Процент правильных предсказаний на отложенной выборке составил: 89.55333333333333%
