Данные библиотеки нужно установить для работы ноутбука. Для запуске в google colab скорей всего потребуется установить только последнюю.

In [None]:
# !pip install -U torch
# !pip install torchtext

# !pip install transformers==3.0.0

In [4]:
import numpy as np
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

import torch
import torch.nn as nn
import torch.nn.functional as F

import transformers
from transformers import BertTokenizer, BertConfig
from transformers import AdamW, BertModel

from keras.preprocessing.sequence import pad_sequences

I0119 19:07:05.948047 4615400896 file_utils.py:39] PyTorch version 1.7.1 available.
Using TensorFlow backend.


In [5]:
# Important!
# later versions have a bug in Bert model
print(transformers.__version__) # should be 3.0.0!

3.0.0


# Prediction

1. Создаем класс модели, которую будем инизиализировать весами, полученными при обучении.
2. Создаем класс Predictor, реализующий функционал предсказания.

In [23]:
MAX_LEN = 52
padding = lambda texts: pad_sequences(texts, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
get_attention_masks = lambda input_ids: [[float(i > 0) for i in seq] for seq in input_ids]

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class BertForSentimentAnalysis(nn.Module):
    def __init__(self):
        super(BertForSentimentAnalysis, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(0.1)
        self.dense = nn.Linear(self.bert.config.hidden_size, 1)
        
    def forward(self, inds, mask):
        _, x = self.bert(inds, mask)
        x = self.dropout(x)
        x = self.dense(x)
        return x


class Predictor:
    def __init__(self, path_to_model='model.pt'):
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
        self.bert_model = BertForSentimentAnalysis().to(device)
        self.bert_model.load_state_dict(torch.load(path_to_model))
        
    def predict(self, text):
        '''Returns a prediction for text'''
        # tokenize text (split to tokens) using Bert tokenizer, add special tokens [CLS], [SEP]
        tokenized_text = self.tokenizer.tokenize("[CLS] " + text + " [SEP]") 
        # convert tokens to indeces
        inds = self.tokenizer.convert_tokens_to_ids(tokenized_text)
        # convert text to unified length (MAX_LEN).
        # truncate long texts to MAX_LEN, pad (add special symbol [PAD]) short texts to MAX_LEN
        inds = padding([inds])
        #calculate mask: 0 for [PAD] token, 1 - otherwise
        attention_mask = get_attention_masks(inds)

        #convert vectors to torch tensors
        X = torch.tensor(inds).to(device)
        mask = torch.tensor(attention_mask).to(device)

        #calculate logits from the model
        logits = self.bert_model(X, mask).squeeze(1)
        #calculate predictions by logits
        prediction = int(torch.round(torch.sigmoid(logits)).detach().cpu().numpy()[0])
        return prediction
    
    def predict_many(self, texts):
        '''Returns a vector of predictions for texts'''
        predictions = [self.predict(text) for text in texts]
        return predictions
    
    def evaluate(self, texts, targets):
        '''Returns accuracy and a vector of predictions for texts'''
        predictions = self.predict_many(texts)
        return np.mean(np.array(predictions) == np.array(targets)), predictions

Создаем объект класса Predictor, это потребует немного времени(несколько секунд), веса должны загрузиться в память и проинициализироваться.

In [24]:
predictor = Predictor()

I0119 20:49:34.044057 4615400896 tokenization_utils_base.py:1254] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /Users/elkir/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
I0119 20:49:34.762545 4615400896 configuration_utils.py:264] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /Users/elkir/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
I0119 20:49:34.766453 4615400896 configuration_utils.py:300] Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Делаем предсказание.

In [25]:
text = "That is the best thing happened  I'll tell you tomorrow when I see you."
predictor.predict(text)

1

Делаем предсказание для набора текстов.

In [26]:
texts = [
    'You are the bad person',
    "I'm tired of this situation",
    'the weather today kind of sucks',
    'wow this is great',
    'i love you',
    'the weather is kinda cool'
]
predictor.predict_many(texts)

[0, 0, 0, 1, 1, 1]

Считаем метрику качества для набора текстов. Для этого загружаем тестовые объекты из датасета Sentiment140

In [28]:
df_test = pd.read_csv("testdata.manual.2009.06.14.csv", encoding = "ISO-8859-1", header=None)
data_test = np.array(df_test[5])
target_test = np.array(df_test[0])

#remove test object with neutral setniment
data_test = data_test[target_test != 2]
target_test = target_test[target_test != 2]
# change target value of 4 to 1
target_test[target_test == 4] = 1

accuracy, predictions = predictor.evaluate(data_test, target_test)
print("Accuracy: {0:.4f}%".format(accuracy * 100))

Accuracy: 86.0724%
