# Семинар 4: Представления слов: продолжение

In [None]:
%%writefile requirements.txt
gensim
pandas
razdel
allennlp
pytorch_lightning

Overwriting requirements.txt


In [None]:
!pip install --upgrade -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Torch

Один из самых известных и удобный фреймворков для обучения нейронных сетей. Не требует компиляции моделей, выполняет всё на лету.
Основа - система автоматического дифференциирования Autograd. По сути Torch = numpy + Autograd + набор готовых модулей нейронных сетей


*Фрагменты этой части взяты из https://github.com/DanAnastasyev/DeepNLP-Course*

### Графы вычислений

Графы вычислений - это такой удобный способ быстро считать градиенты сложных функций.

Например, функция

$$f = (x + y) \cdot z$$

представится графом

![graph](https://image.ibb.co/mWM0Lx/1_6o_Utr7_ENFHOK7_J4l_XJtw1g.png)  
*From [Backpropagation, Intuitions - CS231n](http://cs231n.github.io/optimization-2/)*

Зададим значения $x, y, z$ (зеленым на картинке). Как посчитать $\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}$? (*Вспоминаем, что такое backpropagation*)

В PyTorch такие вычисления делаются очень просто.

Сначала определяется функция - просто последовательность операций:

In [None]:
import torch

x = torch.tensor(-2., requires_grad=True)
y = torch.tensor(5., requires_grad=True)
z = torch.tensor(-4., requires_grad=True)

q = x + y
f = q * z

In [None]:
# df/dx = df/dq * dq/dx

А затем говорим ей: "Посчитай градиенты, пожалуйста"

In [None]:
f.backward()

print('df/dz =', z.grad)
print('df/dx =', x.grad)
print('df/dy =', y.grad)

df/dz = tensor(3.)
df/dx = tensor(-4.)
df/dy = tensor(-4.)


Подробнее о том, как работает autograd, можно почитать здесь: [Autograd mechanics](https://pytorch.org/docs/stable/notes/autograd.html).

В целом, любой тензор в pytorch - аналог многомерных матриц в numpy.

Он содержит данные:

In [None]:
x.data

tensor(-2.)

Накопленный градиент:

In [None]:
x.grad

tensor(-4.)

Функцию, как градиент считать:

In [None]:
q.grad_fn

<AddBackward0 at 0x7f8bc4f5b520>

И всякую дополнительную метаинформацию:

In [None]:
x.type(), x.shape, x.device, x.layout

('torch.FloatTensor', torch.Size([]), device(type='cpu'), torch.strided)

# Свой Word2Vec

А теперь обещанный самописный Word2Vec. Используем для его реализации Torch, хотя конкретно здесь можно было бы и обычным numpy обойтись (но было бы чуть больше сложностей).

### Подготовка

Заново скачиваем всё с предудыщего семинара...

In [None]:
!wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
!gzip -d lenta-ru-news.csv.gz
!head -n 2 lenta-ru-news.csv

--2023-03-27 08:35:06--  https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/87156914/0b363e00-0126-11e9-9e3c-e8c235463bd6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230327%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230327T083506Z&X-Amz-Expires=300&X-Amz-Signature=988b436a6b6812374301eebeeb4bf57bf23ce64e53c1f7cb94085c38a7c40020&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=87156914&response-content-disposition=attachment%3B%20filename%3Dlenta-ru-news.csv.gz&response-content-type=application%2Foctet-stream [following]
--2023-03-27 08:35:07--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/87156914/0b363e00-0126-11e9-9e3c-

In [None]:
import pandas as pd
import re
import datetime as dt
from razdel import tokenize, sentenize
from string import punctuation

def get_date(url):
    dates = re.findall(r"\d\d\d\d\/\d\d\/\d\d", url)
    return next(iter(dates), None)

dataset = pd.read_csv("lenta-ru-news.csv", sep=',', quotechar='\"', escapechar='\\', encoding='utf-8', header=0)
dataset["date"] = dataset["url"].apply(lambda x: dt.datetime.strptime(get_date(x), "%Y/%m/%d"))
dataset = dataset[dataset["date"] > "2017-01-01"]
dataset["text"] = dataset["text"].apply(lambda x: x.replace("\xa0", " "))
dataset["title"] = dataset["title"].apply(lambda x: x.replace("\xa0", " "))
train_dataset = dataset[dataset["date"] < "2018-04-01"]
test_dataset = dataset[dataset["date"] > "2018-04-01"]

def get_texts(dataset):
    texts = []
    for text in dataset["text"]:
        for sentence in sentenize(text):
            texts.append([token.text.lower() for token in tokenize(sentence.text) if token.text not in punctuation])
    
    for title in dataset["title"]:
        texts.append([token.text.lower() for token in tokenize(title) if token.text not in punctuation])
    return texts

texts = get_texts(train_dataset)
test_texts = get_texts(test_dataset)

assert len(texts) == 827217
assert len(texts[0]) > 0
assert texts[0][0].islower()
print(texts[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset["text"] = dataset["text"].apply(lambda x: x.replace("\xa0", " "))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset["title"] = dataset["title"].apply(lambda x: x.replace("\xa0", " "))


['возобновление', 'нормального', 'сотрудничества', 'между', 'россией', 'и', 'нато', 'невозможно', 'пока', 'москва', 'не', 'будет', 'соблюдать', 'нормы', 'международного', 'права']


Напоминание...

![embeddings training](https://miro.medium.com/max/1400/0*o2FCVrLKtdcxPQqc.png)
*From [An implementation guide to Word2Vec using NumPy and Google Sheets
](https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281)*

Статьи:
* Word2Vec: [Distributed Representations of Words and Phrases
and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf), Mikolov et al., 2013
* GloVe: [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf), Pennington, Socher, Manning, 2014
* fastText: [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf), Bojanowski, Grave, Joulin, Mikolov, 2016

Будем сами сторить skip-gram модель

## Предобработка и батчинг

До этого за нас gensim неявно строил словарь. Теперь придётся самим.

In [None]:
from collections import Counter

class Vocabulary:
    def __init__(self):
        self.word2index = {
            "<unk>": 0
        }
        self.index2word = ["<unk>"]

    def build(self, texts, min_count=10):
        words_counter = Counter(token for tokens in texts for token in tokens)
        for word, count in words_counter.most_common():
            if count >= min_count:
                self.word2index[word] = len(self.word2index) #наиболее частые слова имеют наименьший индекс
        self.index2word = [word for word, _ in sorted(self.word2index.items(), key=lambda x: x[1])] #сортируем word2index по индексу, получаем список слов, отсортированных по убыванию по частоте встречаемости | items: dict->list of tuples
    
    @property
    def size(self):
        return len(self.index2word)
    
    def top(self, n=100):
        return self.index2word[1:n+1]
    
    def get_index(self, word):
        return self.word2index.get(word, 0)
    
    def get_word(self, index):
        return self.index2word[index]

vocabulary = Vocabulary()
vocabulary.build(texts)
assert vocabulary.word2index[vocabulary.index2word[10]] == 10
print(vocabulary.size)
print(vocabulary.top(100))

71186
['в', 'и', 'на', '«', '»', 'что', 'с', 'по', '—', 'не', 'из', 'этом', 'об', 'о', 'он', 'за', 'года', 'россии', 'к', 'его', 'для', 'как', 'также', 'от', 'а', 'это', 'сообщает', 'до', 'году', 'после', 'сша', 'у', 'во', 'время', 'был', 'при', 'заявил', 'со', 'словам', 'рублей', 'будет', 'ее', 'она', 'но', 'ранее', 'их', 'они', 'было', 'тысяч', 'более', 'того', 'том', 'мы', 'были', 'я', 'которые', 'все', 'который', 'человек', 'под', '2016', 'из-за', 'лет', '2017', 'украины', 'марта', 'процентов', 'чтобы', 'долларов', 'глава', 'президент', 'этого', 'отметил', 'же', 'сказал', 'так', 'января', 'или', 'страны', 'ру', 'то', 'еще', 'области', 'данным', 'была', 'президента', 'около', 'сообщил', 'февраля', 'однако', 'компании', 'может', 'уже', 'один', 'рассказал', 'только', 'процента', '1', '10', 'июня']


Собираем все центральные слова и их контексты, преобразуем в словарные индексы.

In [None]:
def build_contexts(tokenized_texts, vocabulary, window_size):
    contexts = []
    for tokens in tokenized_texts:
        for i in range(len(tokens)):
            central_word = vocabulary.get_index(tokens[i])
            context = [vocabulary.get_index(tokens[i + delta]) for delta in range(-window_size, window_size + 1) 
                       if delta != 0 and i + delta >= 0 and i + delta < len(tokens)]
            if len(context) != 2 * window_size:
                continue

            contexts.append((central_word, context)) #выполнится только если предыдущее условие не выполнится
            
    return contexts

contexts = build_contexts(texts, vocabulary, window_size=2)
print(contexts[:5])
print(vocabulary.get_word(contexts[0][0]), [vocabulary.get_word(index) for index in contexts[0][1]])

[(1568, [17232, 26343, 135, 371]), (135, [26343, 1568, 371, 2]), (371, [1568, 135, 2, 695]), (2, [135, 371, 695, 2140]), (695, [371, 2, 2140, 216])]
сотрудничества ['возобновление', 'нормального', 'между', 'россией']


Делаем генератор батчей для ускорения обработки.

In [None]:
import random
import numpy as np
import torch

def get_next_batch(contexts, window_size, batch_size, epochs_count):
    assert batch_size % (window_size * 2) == 0 #проверка, чтобы в батчи вошли все контексты с их центр. словами
    central_words, contexts = zip(*contexts)
    batch_size //= (window_size * 2) #взятие целой части от деления
    
    for epoch in range(epochs_count):
        print(f'epoch #{epoch}'.center(50, '-'))
        indices = np.arange(len(contexts))
        np.random.shuffle(indices)
        batch_begin = 0
        while batch_begin < len(contexts): #идем, пока не пройдем все контексты (центр.слова)
            batch_indices = indices[batch_begin: batch_begin + batch_size]
            batch_contexts, batch_centrals = [], []
            for data_ind in batch_indices:
                central_word, context = central_words[data_ind], contexts[data_ind]
                batch_contexts.extend(context)
                batch_centrals.extend([central_word] * len(context)) #чтобы каждому слову в контексте стояло по одному центр.слову
                
            batch_begin += batch_size
            yield torch.cuda.LongTensor(batch_contexts), torch.cuda.LongTensor(batch_centrals)

print(next(get_next_batch(contexts, window_size=2, batch_size=64, epochs_count=10)))

---------------------epoch #0---------------------
(tensor([    9,    28,    49,   445,     1,   885,  1548,   404,     6,  8549,
            0,    27,  1517,   177,  2475, 30439,    19,   727,   270, 35103,
          146,   340, 22187,  8344, 32373, 12735,  2743,   162,    32,  6772,
          618,     7,     0,     0,  1609,  8406,   215,     0,     4, 61153,
         1197,     1,   795,   455, 25342,  7827,    19,    18,  6106,  3672,
           75,    70,   465,   374,     0,    21, 39100,  1824,     2,  2560,
        10452,     9,     2,   231], device='cuda:0'), tensor([  573,   573,   573,   573,  1286,  1286,  1286,  1286,   469,   469,
          469,   469, 18823, 18823, 18823, 18823,   387,   387,   387,   387,
           35,    35,    35,    35,     4,     4,     4,     4,  5804,  5804,
         5804,  5804,   105,   105,   105,   105,  5122,  5122,  5122,  5122,
          250,   250,   250,   250,  7376,  7376,  7376,  7376,   917,   917,
          917,   917,    14,    14,

## Модель и обучение

5 слов в словаре
Строим эмбеддинги размерности 4

На вход подаётся слово с индексом 3

(3, 1)

Слово из контекста под индексом 1

W = [1, 2, 3, 4,
5, 6, 7, 8,
9, 10, 11, 12,
13, 14, 15, 16,
17, 18, 19, 20]

V_3 = [9, 10, 11, 12]

U = [20, 19, 18, 17, 16,
15, 14, 13, 12, 11,
10, 9, 8, 7, 6,
5, 4, 3, 2, 1]

V_3' = [100, 200, 300, 400, 500] <-----


e^(x_i) / sum_j(e^(x_j))
V_3'' = [0.1, 0.15, 0.2, 0.25, 0.3]
U_1 = [0, 1, 0, 0, 0]

CELoss = -sum_i(y_i * log(p_i)) = -log(0.15)



In [None]:
import torch.nn as nn
import torch.optim as optim 
import time

class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim=32):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        projections = self.embeddings(inputs)
        output = self.out_layer(projections)
        return output
      

model = SkipGramModel(vocabulary.size, 32)

device = torch.device("cuda")
model = model.to(device)

loss_every_nsteps = 10000
total_loss = 0
start_time = time.time()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_function = nn.CrossEntropyLoss().cuda()

for step, (batch_contexts, batch_centrals) in enumerate(get_next_batch(contexts, window_size=2, batch_size=256, epochs_count=10)):
    logits = model(batch_centrals) # Прямой проход
    loss = loss_function(logits, batch_contexts) # Подсчёт ошибки
    loss.backward() # Подсчёт градиентов dL/dw
    optimizer.step() # Градиентный спуск или его модификации (в данном случае Adam)
    optimizer.zero_grad() # Зануление градиентов, чтобы их спокойно менять на следующей итерации

    total_loss += loss.item()
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, time.time() - start_time))
        total_loss = 0
        start_time = time.time()

---------------------epoch #0---------------------
Step = 10000, Avg Loss = 8.2186, Time = 84.21s
Step = 20000, Avg Loss = 7.9063, Time = 75.48s
Step = 30000, Avg Loss = 7.8114, Time = 75.39s
Step = 40000, Avg Loss = 7.7658, Time = 75.42s
Step = 50000, Avg Loss = 7.7294, Time = 75.46s
Step = 60000, Avg Loss = 7.7105, Time = 75.46s
Step = 70000, Avg Loss = 7.6923, Time = 75.45s
Step = 80000, Avg Loss = 7.6812, Time = 75.38s
Step = 90000, Avg Loss = 7.6741, Time = 75.40s
Step = 100000, Avg Loss = 7.6633, Time = 75.38s
Step = 110000, Avg Loss = 7.6639, Time = 75.43s
Step = 120000, Avg Loss = 7.6576, Time = 75.40s
Step = 130000, Avg Loss = 7.6538, Time = 75.40s
Step = 140000, Avg Loss = 7.6496, Time = 75.40s
---------------------epoch #1---------------------
Step = 150000, Avg Loss = 7.6274, Time = 76.05s
Step = 160000, Avg Loss = 7.6231, Time = 75.36s
Step = 170000, Avg Loss = 7.6353, Time = 75.37s
Step = 180000, Avg Loss = 7.6332, Time = 75.35s
Step = 190000, Avg Loss = 7.6364, Time = 75

Теперь получим доступ к весам. Для этого их нужно перенести из памяти gpu и преобразовать из тензора в numpy матрицу.

In [None]:
embeddings = model.embeddings.weight.cpu().data.numpy()

## PyTorch Lightning
Обёртка для Torch, которая абстрагирует процесс обучения

In [None]:
import torch
import json
import random
from itertools import cycle
from torch.utils.data import Dataset, IterableDataset

def get_samples(tokenized_texts, window_size, texts_count):
    for text_num, tokens in enumerate(tokenized_texts):
        if texts_count and text_num >= texts_count:
            break
        for i in range(len(tokens)):
            central_word = vocabulary.get_index(tokens[i])
            for delta in range(-window_size, window_size + 1):
                if delta == 0:
                    continue
                if 0 <= (i + delta) < len(tokens):
                    context_word = vocabulary.get_index(tokens[i + delta])
                    yield (torch.cuda.LongTensor([central_word]),
                           torch.cuda.LongTensor([context_word]))


def get_samples_cycle(tokenized_texts, window_size, texts_count):
    while True:
        for sample in get_samples(tokenized_texts, window_size, texts_count):
            yield sample


class Word2VecDataset(Dataset):
    def __init__(self, tokenized_texts, vocabulary, window_size=2, texts_count=100000):
        self.samples = list(get_samples(tokenized_texts, window_size, texts_count))
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, index):
        return self.samples[index]


class Word2VecIterableDataset(IterableDataset):
    def __init__(self, tokenized_texts, vocabulary, window_size=2, texts_count=None):
        self.tokenized_texts = tokenized_texts
        self.vocabulary = vocabulary
        self.window_size = window_size
        self.texts_count = texts_count

    def __iter__(self):
        return get_samples_cycle(self.tokenized_texts, self.window_size, self.texts_count)

In [None]:
from torch.utils.data import DataLoader, RandomSampler

BATCH_SIZE = 256

random.shuffle(texts)
train_data = Word2VecIterableDataset(texts, vocabulary)
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE)

random.shuffle(test_texts)
val_data = Word2VecIterableDataset(test_texts, vocabulary)
val_loader = DataLoader(val_data, batch_size=BATCH_SIZE)

In [None]:
import torch
import torch.nn as nn
from pytorch_lightning import LightningModule

class SkipGramModel(LightningModule):
    def __init__(self, vocab_size, embedding_dim=128):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)
        self.loss = nn.CrossEntropyLoss()
        self.validation_step_outputs = []

    
    def forward(self, centrals, contexts):
        projections = self.embeddings.forward(centrals)
        logits = self.out_layer.forward(projections)
        logits = logits.transpose(1, 2)
        loss = self.loss(logits, contexts)
        return loss
    
    def training_step(self, batch, batch_nb):
        return {'loss': self(*batch)}
    
    def validation_step(self, batch, batch_nb):
        self.validation_step_outputs.append(loss)
        return {'val_loss': self(*batch)}

    def test_step(self, batch, batch_nb):
        return {'test_loss': self(*batch)}
    
    def on_validation_epoch_end(self):
        avg_loss = torch.stack(self.validation_step_outputs).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'progress_bar': tensorboard_logs}

    def test_epoch_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        tensorboard_logs = {'test_loss': avg_loss}
        return {'test_loss': avg_loss, 'progress_bar': tensorboard_logs}
    
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-2)
        return [optimizer]

In [None]:
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import EarlyStopping

EPOCHS = 1

model = SkipGramModel(vocabulary.size)
early_stop_callback = EarlyStopping(
    monitor="val_loss",
    min_delta=0.0,
    patience=5,
    verbose=True,
    mode="min" 
)
trainer = Trainer(
    devices=1, #gpus 
    #checkpoint_callback=False,
    max_epochs=EPOCHS,
    callbacks=[early_stop_callback],
    #progress_bar_refresh_rate=100,
    limit_train_batches=40000,
    limit_val_batches=500,
    val_check_interval=2000)
trainer.fit(model, train_loader, val_loader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name       | Type             | Params
------------------------------------------------
0 | embeddings | Embedding        | 9.1 M 
1 | out_layer  | Linear           | 9.2 M 
2 | loss       | CrossEntropyLoss | 0     
------------------------------------------------
18.3 M    Trainable params
0         Non-trainable params
18.3 M    Total params
73.179    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

RuntimeError: ignored

In [None]:
model.freeze()

In [None]:
embeddings = model.embeddings.weight.cpu().data.numpy()

In [None]:
import numpy as np
np.save("embeddings.npy", embeddings)

## Базовые проверки

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def most_similar(embeddings, vocabulary, word):
    word_emb = embeddings[vocabulary.get_index(word)]
    
    similarities = cosine_similarity([word_emb], embeddings)[0]
    top10 = np.argsort(similarities)[-10:]
    
    return [vocabulary.get_word(index) for index in reversed(top10)]

most_similar(embeddings, vocabulary, 'путин')

['путин',
 'жириновский',
 'семашко',
 'мединский',
 'сафронов',
 'познер',
 'гройсман',
 'владимир',
 'кистион',
 'президент']

Сделаем такую же визуализацию, какая была на прошлом семинаре.

In [None]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale


def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()
    
    if isinstance(color, str): 
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: 
        pl.show(fig)
    return fig


def get_tsne_projection(word_vectors):
    tsne = TSNE(n_components=2)
    return scale(tsne.fit_transform(word_vectors))

def get_pca_projection(word_vectors):
    pca = PCA(n_components=2)
    return scale(pca.fit_transform(word_vectors))
    
    
def visualize_embeddings(embeddings, vocabulary, word_count, method="pca"):
    word_vectors = embeddings[1: word_count + 1]
    words = vocabulary.top(word_count)
    get_projections = get_pca_projection if method == "pca" else get_tsne_projection
    projections = get_projections(word_vectors)
    draw_vectors(projections[:, 0], projections[:, 1], color='green', token=words)
    
    
visualize_embeddings(embeddings, vocabulary, 500, method="tsne")



## Задача рубрикации

In [None]:
def get_text_embedding(embeddings, vocabulary, phrase):
    embeddings = np.array([embeddings[vocabulary.get_index(word.text.lower())] for word in tokenize(phrase)])
    return np.mean(embeddings, axis=0)

target_labels = set(train_dataset["topic"].dropna().tolist())
target_labels -= {"69-я параллель", "Крым", "Культпросвет ", "Оружие", "Бизнес", "Путешествия"}
target_labels = list(target_labels)
print(target_labels)

pattern = r'(\b{}\b)'.format('|'.join(target_labels))

train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
train_with_topics = train_with_topics.head(20000)

test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]

y_train = train_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_train = np.zeros((train_with_topics.shape[0], embeddings.shape[1]))
for i, embedding in enumerate(train_with_topics["text"]):
    X_train[i, :] = get_text_embedding(embeddings, vocabulary, embedding)

y_test = test_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_test = np.zeros((test_with_topics.shape[0], embeddings.shape[1]))
for i, embedding in enumerate(test_with_topics["text"]):
    X_test[i, :] = get_text_embedding(embeddings, vocabulary, embedding)

print(X_train.shape)
print(y_train)

  train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
  test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]


['Силовые структуры', 'Россия', 'Из жизни', 'Интернет и СМИ', 'Экономика', 'Культура', 'Бывший СССР', 'Наука и техника', 'Спорт', 'Ценности', 'Мир', 'Дом']
(20000, 32)
[10 10  5 ...  3  4  9]


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

clf = MLPClassifier()
clf.fit(X_train, y_train)

y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.66      0.64      0.65      1663
           1       0.64      0.64      0.64      4324
           2       0.73      0.74      0.73      2191
           3       0.68      0.60      0.64      2447
           4       0.73      0.81      0.77      3185
           5       0.80      0.66      0.73      1995
           6       0.67      0.64      0.65      2156
           7       0.85      0.86      0.85      2119
           8       0.92      0.92      0.92      3429
           9       0.85      0.73      0.79      1177
          10       0.71      0.80      0.75      4291
          11       0.68      0.69      0.69      1182

    accuracy                           0.74     30159
   macro avg       0.74      0.73      0.73     30159
weighted avg       0.74      0.74      0.74     30159





## Бенчмарки

### SimLex

В датасете пары слов с похожестями, нужно посчитать корреляцию Спирмена между ними и наишими похожестями.

In [None]:
!wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv

--2023-03-26 21:32:35--  https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv
Resolving rusvectores.org (rusvectores.org)... 172.104.228.108
Connecting to rusvectores.org (rusvectores.org)|172.104.228.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42924 (42K) [text/tab-separated-values]
Saving to: ‘ru_simlex965_tagged.tsv’


2023-03-26 21:32:36 (302 KB/s) - ‘ru_simlex965_tagged.tsv’ saved [42924/42924]



In [None]:
!head ru_simlex965_tagged.tsv

# Word1	Word2	Average Score
авария_NOUN	бедствие_NOUN	6.15
август_NOUN	месяц_NOUN	2.85
авиация_NOUN	полет_NOUN	6.77
автомобиль_NOUN	гудок_NOUN	1.85
автомобиль_NOUN	автострада_NOUN	1.23
автомобиль_NOUN	такси_NOUN	4.15
автомобиль_NOUN	датчик_NOUN	1.62
автомобиль_NOUN	велосипед_NOUN	1.38
автомобиль_NOUN	карета_NOUN	3


## Задание 1: Самописный CBoW

Сделайте аналогичную модель, но в архитектуре CBoW

### модель

In [None]:
import torch.nn as nn
import torch.optim as optim 
import time

class CBoWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim=32):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        projections = self.embeddings(inputs)
        output = self.out_layer(projections)
        return output
      

model = CBoWModel(vocabulary.size, 32)

device = torch.device("cuda")
model = model.to(device)

loss_every_nsteps = 10000
total_loss = 0
start_time = time.time()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_function = nn.CrossEntropyLoss().cuda()

for step, (batch_contexts, batch_centrals) in enumerate(get_next_batch(contexts, window_size=2, batch_size=256, epochs_count=1)):
    logits = model(batch_contexts) # Прямой проход
    loss = loss_function(logits, batch_centrals) # Подсчёт ошибки
    loss.backward() # Подсчёт градиентов dL/dw
    optimizer.step() # Градиентный спуск или его модификации (в данном случае Adam)
    optimizer.zero_grad() # Зануление градиентов, чтобы их спокойно менять на следующей итерации

    total_loss += loss.item()
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, time.time() - start_time))
        total_loss = 0
        start_time = time.time()

---------------------epoch #0---------------------


KeyboardInterrupt: ignored

### тестирование

In [None]:
embeddings = model.embeddings.weight.cpu().data.numpy()

In [None]:
def get_text_embedding(embeddings, vocabulary, phrase):
    embeddings = np.array([embeddings[vocabulary.get_index(word.text.lower())] for word in tokenize(phrase)])
    return np.mean(embeddings, axis=0)

target_labels = set(train_dataset["topic"].dropna().tolist())
target_labels -= {"69-я параллель", "Крым", "Культпросвет ", "Оружие", "Бизнес", "Путешествия"}
target_labels = list(target_labels)
print(target_labels)

pattern = r'(\b{}\b)'.format('|'.join(target_labels))

train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
train_with_topics = train_with_topics.head(20000)

test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]

y_train = train_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_train = np.zeros((train_with_topics.shape[0], embeddings.shape[1]))
for i, embedding in enumerate(train_with_topics["text"]):
    X_train[i, :] = get_text_embedding(embeddings, vocabulary, embedding)

y_test = test_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_test = np.zeros((test_with_topics.shape[0], embeddings.shape[1]))
for i, embedding in enumerate(test_with_topics["text"]):
    X_test[i, :] = get_text_embedding(embeddings, vocabulary, embedding)

print(X_train.shape)
print(y_train)

['Бывший СССР', 'Интернет и СМИ', 'Культура', 'Россия', 'Мир', 'Силовые структуры', 'Наука и техника', 'Экономика', 'Дом', 'Ценности', 'Из жизни', 'Спорт']


  train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
  test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]


(20000, 32)
[4 4 2 ... 1 7 9]


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

clf = MLPClassifier()
clf.fit(X_train, y_train)

y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.46      0.36      0.40      2156
           1       0.32      0.29      0.30      2447
           2       0.43      0.40      0.42      1995
           3       0.40      0.41      0.41      4324
           4       0.49      0.60      0.54      4291
           5       0.52      0.45      0.48      1663
           6       0.43      0.48      0.46      2119
           7       0.44      0.48      0.46      3185
           8       0.48      0.38      0.42      1182
           9       0.46      0.45      0.45      1177
          10       0.53      0.50      0.51      2191
          11       0.65      0.62      0.63      3429

    accuracy                           0.47     30159
   macro avg       0.47      0.45      0.46     30159
weighted avg       0.47      0.47      0.47     30159





## Задание 2: Negative Sampling

* 0) 1 - слова из контекста, 0 - случайные слова из словаря согласно unigram распределению в степени alpha, alpha=0.75
* 1) Linear -> Embedding
* 2) Second embedding layer apply to context word
* 3) Dot product emb1 and emb2 -> scalar (а раньше был * вектор размерности словая)
* 4) CrossEntropyLoss -> BCELoss
* 5) Triplet loss: (pivot, positive, negative): pivot * positive - pivot * negative


Реализуйте negative sampling вместо полного softmax'а

### unigram-распределение слов

In [None]:
# unigram распределение: https://translated.turbopages.org/proxy_u/en-ru.ru.ba32465b-6421548d-39b0f577-74722d776562/https/stats.stackexchange.com/questions/605177/calculating-noise-distribution-in-skip-gram-negative-sampling
all_words = [token for tokens in texts for token in tokens if token in vocabulary.word2index.keys()]
words_count = Counter(all_words)
unig_dist = {word_i: words_count[word_i]/len(all_words) for word_i in set(all_words)}
alpha = 0.75
noise_dist = {key: val ** alpha for key, val in unig_dist.items()}
Z = sum(noise_dist.values())
noise_dist_normalized = {key: val / Z for key, val in noise_dist.items()}

In [None]:
# словарь вида {word: prob} преобразуем к виду {ind_of_word: prob}
index2prob = {vocabulary.word2index[word_i]: noise_dist_normalized[word_i] for word_i in noise_dist_normalized.keys()}

### генерация батчей вместе negative samples

In [None]:
import random
import numpy as np
import torch

def get_next_batch_neg_smpl(contexts, window_size, batch_size, epochs_count):
    assert batch_size % (window_size * 2) == 0 #проверка, чтобы в батчи вошли все контексты с их центр. словами
    central_words, contexts = zip(*contexts)
    batch_size //= (window_size * 2) #взятие целой части от деления
    
    for epoch in range(epochs_count):
        print(f'epoch #{epoch}'.center(50, '-'))
        indices = np.arange(len(contexts))
        np.random.shuffle(indices)
        batch_begin = 0
        while batch_begin < len(contexts): #идем, пока не пройдем все контексты (центр.слова)
            #print(batch_begin)
            batch_indices = indices[batch_begin: batch_begin + batch_size]
            batch_contexts, batch_centrals, batch_targets = [], [], []
            for data_ind in batch_indices:
                central_word, context = central_words[data_ind], contexts[data_ind]
                negative_samples = np.random.choice(list(index2prob.keys()), size=len(context), p=list(index2prob.values()))
                batch_contexts.extend(context)
                batch_contexts.extend(negative_samples)
                batch_centrals.extend([central_word] * len(context) * 2) #чтобы каждому слову в контексте и в negative smpl стояло по одному центр.слову
                batch_targets.extend([1]*len(context)) #слова из контекста
                batch_targets.extend([0]*len(context)) #случайные слова из словаря согласно unigram распределению
            batch_begin += batch_size
            yield {'contexts': torch.cuda.LongTensor(batch_contexts),
                   'centrals': torch.cuda.LongTensor(batch_centrals),
                   'targets': torch.cuda.FloatTensor(batch_targets)} #тип, необходимый для BCELoss

print(next(get_next_batch_neg_smpl(contexts, window_size=2, batch_size=64, epochs_count=10)))

---------------------epoch #0---------------------
{'contexts': tensor([    8,   311,   650, 10245,  3582,  1330, 40898,   586,     0, 20005,
         1904,   234,   654, 17825, 13461,  8308,  4308,   432,  2344,    65,
        34076,   148, 63292, 62371,   661,    36,  1397,  3123,  6714, 50964,
          242, 10176,    78,  7767,   330,   289,  6295,   283,  7769,     3,
          224,  4011, 24279, 13022,  1015, 37556, 11051, 11038,   376, 34174,
            0,    90,  8048,    17, 33927, 15435,  9128,     4,     0,     5,
         5044, 10306,    22, 52204,     1,   624,     7,   993,  1998,   230,
            7,  4271,  3315,     5,     0,  2637, 19288, 30850,    76,   206,
         1692,    46,     9,    87,  3599,  6593, 18889,  5769,     0, 21061,
        22526,     1,  7325,  4419,  7420, 58694,   571,     3,    89, 35597,
          669, 25224, 53696,   393,   140,    26,    10,  1193, 51389,  3829,
        52295,   418,   670,  6125,  3014,     1,  4987,  2823,    33,  3193,


### модель

In [None]:
import torch.nn as nn
import torch.optim as optim
import time

class NegSmpl_Model(nn.Module):
    def __init__(self, vocab_size, embedding_dim=32):
        super().__init__()  
        self.embeddings_1 = nn.Embedding(vocab_size, embedding_dim)
        self.embeddings_2 = nn.Embedding(vocab_size, embedding_dim)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, inputs):
        words_1 = self.embeddings_1(inputs['contexts'])
        words_2 = self.embeddings_2(inputs['centrals'])
        #dot_product = torch.dot(words_1, words_2) #скалярное произведение for 1-D
        mtx_mul = torch.mul(words_1, words_2)      #скалярное произведение for 2-D
        dot_product = torch.sum(mtx_mul, dim=1)    #скалярное произведение for 2-D
        #print('embed size'.ljust(30), words_1.size())
        #print('mtx_mul size'.ljust(30), mtx_mul.size())
        #print('dot_product size'.ljust(30), dot_product.size())
        output = self.sigmoid(dot_product)
        return output

In [None]:
model = NegSmpl_Model(vocabulary.size, 32)

device = torch.device("cuda")
model = model.to(device)

loss_every_nsteps = 1000
total_loss = 0
start_time = time.time()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_function = nn.BCELoss().cuda()

for step, inputs in enumerate(get_next_batch_neg_smpl(contexts, window_size=2, batch_size=256, epochs_count=1)):
    logits = model(inputs) # Прямой проход
    loss = loss_function(logits, inputs['targets']) # Подсчёт ошибки
    loss.backward()  # Подсчёт градиентов dL/dw
    optimizer.step() # Градиентный спуск или его модификации (в данном случае Adam)
    optimizer.zero_grad() # Зануление градиентов, чтобы их спокойно менять на следующей итерации

    total_loss += loss.item()
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, time.time() - start_time))
        total_loss = 0
        start_time = time.time()

---------------------epoch #0---------------------


KeyboardInterrupt: ignored

### тестирование

In [None]:
embeddings = model.embeddings_2.weight.cpu().data.numpy()

In [None]:
def get_text_embedding(embeddings, vocabulary, phrase):
    embeddings = np.array([embeddings[vocabulary.get_index(word.text.lower())] for word in tokenize(phrase)])
    return np.mean(embeddings, axis=0)

target_labels = set(train_dataset["topic"].dropna().tolist())
target_labels -= {"69-я параллель", "Крым", "Культпросвет ", "Оружие", "Бизнес", "Путешествия"}
target_labels = list(target_labels)
print(target_labels)

pattern = r'(\b{}\b)'.format('|'.join(target_labels))

train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
train_with_topics = train_with_topics.head(20000)

test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]

y_train = train_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_train = np.zeros((train_with_topics.shape[0], embeddings.shape[1]))
for i, embedding in enumerate(train_with_topics["text"]):
    X_train[i, :] = get_text_embedding(embeddings, vocabulary, embedding)

y_test = test_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_test = np.zeros((test_with_topics.shape[0], embeddings.shape[1]))
for i, embedding in enumerate(test_with_topics["text"]):
    X_test[i, :] = get_text_embedding(embeddings, vocabulary, embedding)

print(X_train.shape)
print(y_train)

['Бывший СССР', 'Интернет и СМИ', 'Культура', 'Россия', 'Мир', 'Силовые структуры', 'Наука и техника', 'Экономика', 'Дом', 'Ценности', 'Из жизни', 'Спорт']


  train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
  test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]


(20000, 32)
[4 4 2 ... 1 7 9]


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

clf = MLPClassifier()
clf.fit(X_train, y_train)

y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))



              precision    recall  f1-score   support

           0       0.31      0.18      0.23      2156
           1       0.23      0.25      0.24      2447
           2       0.30      0.26      0.28      1995
           3       0.22      0.30      0.26      4324
           4       0.29      0.42      0.34      4291
           5       0.25      0.10      0.14      1663
           6       0.28      0.33      0.31      2119
           7       0.33      0.24      0.28      3185
           8       0.26      0.21      0.23      1182
           9       0.19      0.15      0.17      1177
          10       0.41      0.31      0.35      2191
          11       0.39      0.41      0.40      3429

    accuracy                           0.29     30159
   macro avg       0.29      0.26      0.27     30159
weighted avg       0.29      0.29      0.28     30159



## Triplet loss

(pivot, positive, negative): pivot * positive - pivot * negative

### генерация батчей

In [None]:
def get_next_batch_triplet(contexts, window_size, batch_size, epochs_count):
    assert batch_size % (window_size * 2) == 0 #проверка, чтобы в батчи вошли все контексты с их центр. словами
    central_words, contexts = zip(*contexts)
    batch_size //= (window_size * 2) #взятие целой части от деления
    
    for epoch in range(epochs_count):
        print(f'epoch #{epoch}'.center(50, '-'))
        indices = np.arange(len(contexts))
        np.random.shuffle(indices)
        batch_begin = 0
        while batch_begin < len(contexts): #идем, пока не пройдем все контексты (центр.слова)
            batch_indices = indices[batch_begin: batch_begin + batch_size]
            batch_pivot, batch_positive, batch_negative = [], [], []
            for data_ind in batch_indices:
                central_word, context = central_words[data_ind], contexts[data_ind]
                negative_samples = np.random.choice(list(index2prob.keys()), size=len(context), p=list(index2prob.values()))
                batch_pivot.extend([central_word] * len(context))
                batch_positive.extend(context)
                batch_negative.extend(negative_samples)
            batch_begin += batch_size
            yield {'pivot': torch.cuda.LongTensor(batch_pivot),
                   'positive': torch.cuda.LongTensor(batch_positive),
                   'negative': torch.cuda.LongTensor(batch_negative)
            }

print(next(get_next_batch_neg_smpl(contexts, window_size=2, batch_size=64, epochs_count=10)))

---------------------epoch #0---------------------
{'contexts': tensor([  136,  4454,     0, 61858,    39,  9230,  3243,  3351,    49,    59,
          172,    38,     4,    25, 24836,    57,    20,     2,     1, 16668,
        12590, 18498, 10683,     2,  4932, 46412,   740,  2036,     2, 12829,
        49533,  3333,   216,     2, 12454, 34118, 11003,    29,   292, 11742,
          214,     8,  5483,  4614,  3054,    13,   459,  6342,  6488,    85,
           33,    34,  7346,  4069,   865, 70883,     9,  6287,    19,     0,
         4982,  5290,   200,   173,  6192,    35,     1,   452, 68840,  4031,
        40442,    10,     1,  4476,  3599,  3146, 53737, 42580,  2911,   312,
         8199, 24337,    31,   685,  6219, 11064,  4038,   249,   464,     8,
            0,    27,  8672,   228,  6629, 56193,   741,     0, 15317,     5,
        25735,  6931,  7440,  3066,  2171, 19899,  9626,   117,   445, 26580,
         7622,    16,     0,    45,    33,   204,  1599,  2855,  4609,  2529,


### модель

In [None]:
class Triplet_Model(nn.Module):
    def __init__(self, vocab_size, embedding_dim=32):
        super().__init__()  
        self.embeddings_1 = nn.Embedding(vocab_size, embedding_dim)
        self.embeddings_2 = nn.Embedding(vocab_size, embedding_dim)
        self.embeddings_3 = nn.Embedding(vocab_size, embedding_dim)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, inputs):
        pivot = self.embeddings_1(inputs['pivot'])
        positive = self.embeddings_2(inputs['positive'])
        negative = self.embeddings_3(inputs['negative'])
        return {'pivot': pivot, 'positive': positive, 'negative': negative}

In [None]:
# инфо отсюда: https://discuss.pytorch.org/t/custom-function-missing-argument/11018
class TripletLoss(nn.Module):
    def __init__(self):
        # EDITED: argument order was wrong
        super(TripletLoss, self).__init__()

    def forward(self, outputs):
        pivot = outputs['pivot']
        positive = outputs['positive']
        negative = outputs['negative']

        a_1 = torch.mul(pivot, positive)
        a_1 = torch.sum(a_1, dim=1)

        a_2 = torch.mul(pivot, negative)
        a_2 = torch.sum(a_2, dim=1)

        loss = a_1 - a_2
        return loss.sum()

In [None]:
model = Triplet_Model(vocabulary.size, 32)

device = torch.device("cuda")
model = model.to(device)

loss_every_nsteps = 1000
total_loss = 0
start_time = time.time()
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_function = TripletLoss().cuda()

for step, inputs in enumerate(get_next_batch_triplet(contexts, window_size=2, batch_size=256, epochs_count=1)):
    logits = model(inputs) # Прямой проход
    loss = loss_function(logits) # Подсчёт ошибки
    loss.backward()  # Подсчёт градиентов dL/dw
    optimizer.step() # Градиентный спуск или его модификации (в данном случае Adam)
    optimizer.zero_grad() # Зануление градиентов, чтобы их спокойно менять на следующей итерации

    total_loss += loss.item()
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, time.time() - start_time))
        total_loss = 0
        start_time = time.time()

---------------------epoch #0---------------------


KeyboardInterrupt: ignored

### тестирвоание

In [None]:
embeddings = model.embeddings_1.weight.cpu().data.numpy()

In [None]:
def get_text_embedding(embeddings, vocabulary, phrase):
    embeddings = np.array([embeddings[vocabulary.get_index(word.text.lower())] for word in tokenize(phrase)])
    return np.mean(embeddings, axis=0)

target_labels = set(train_dataset["topic"].dropna().tolist())
target_labels -= {"69-я параллель", "Крым", "Культпросвет ", "Оружие", "Бизнес", "Путешествия"}
target_labels = list(target_labels)
print(target_labels)

pattern = r'(\b{}\b)'.format('|'.join(target_labels))

train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
train_with_topics = train_with_topics.head(20000)

test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]

y_train = train_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_train = np.zeros((train_with_topics.shape[0], embeddings.shape[1]))
for i, embedding in enumerate(train_with_topics["text"]):
    X_train[i, :] = get_text_embedding(embeddings, vocabulary, embedding)

y_test = test_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_test = np.zeros((test_with_topics.shape[0], embeddings.shape[1]))
for i, embedding in enumerate(test_with_topics["text"]):
    X_test[i, :] = get_text_embedding(embeddings, vocabulary, embedding)

print(X_train.shape)
print(y_train)

['Бывший СССР', 'Интернет и СМИ', 'Культура', 'Россия', 'Мир', 'Силовые структуры', 'Наука и техника', 'Экономика', 'Дом', 'Ценности', 'Из жизни', 'Спорт']


  train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
  test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]


(20000, 32)
[4 4 2 ... 1 7 9]


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

clf = MLPClassifier()
clf.fit(X_train, y_train)

y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))



              precision    recall  f1-score   support

           0       0.24      0.09      0.13      2156
           1       0.28      0.17      0.21      2447
           2       0.38      0.29      0.33      1995
           3       0.24      0.40      0.30      4324
           4       0.33      0.43      0.37      4291
           5       0.29      0.21      0.24      1663
           6       0.36      0.33      0.34      2119
           7       0.32      0.30      0.31      3185
           8       0.33      0.19      0.24      1182
           9       0.28      0.29      0.28      1177
          10       0.46      0.46      0.46      2191
          11       0.45      0.43      0.44      3429

    accuracy                           0.33     30159
   macro avg       0.33      0.30      0.31     30159
weighted avg       0.33      0.33      0.32     30159



# Возможные unsupervised таргеты
У пословных моделей есть ряд проблем. Основная - в разных контекстах у одинаковых токенов будут одинаковые представления. Кроме того, наивные Skip-gram и CBoW не учитывают порядок токенов в контексте. 

Как извлечь информацию из сырых текстов? Чему должны учиться модели, из которых мы получим наши представления?

1.   **Skip-gram** - 2010
2.   **CBoW** - 2010
3.   Fasttext - 2015
3.   LM: language modeling (ELMo, ULMFiT) - 2017
4.   NSP: next sentence prediction (BERT, в модификациях иногда убирается) - 2018
5.   MLM: masked language modeling (BERT, основной таргет) - 2018 - задача классификации
6.   Доменоспецифичные таргеты (предсказание, какой заголовок соответствует тексту новости)
7.   Denoising auto-encoding (BART, mBART, T5) - seq2seq



# Языковые модели



Языковое моделирование - довольно древняя и понятная задача. Статистичская языковая модель (statistical language model) - вероятностное распределение над последовательностями слов $$P(w_1,...,w_n)$$

Другая постановка:
$$P(w_n | w_1,...,w_{n-1}) = P(w_n|w_1^{n-1})$$

N-граммные модели:

$$P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-N+1}^{n-1})$$

## Пример N-граммной модели

In [None]:
class NGramModel:
    def __init__(self, vocabulary, n=4):
        self.n = n
        self.n_grams = [Counter() for _ in range(n+1)]
        self.vocabulary = vocabulary
    
    def collect_n_grams(self, tokens):
        indices = [vocabulary.get_index(token) for token in tokens]
        count = len(indices)
        for n in range(self.n + 1):
            for i in range(min(count - n + 1, count)):
                n_gram = indices[i:i+n]
                self.n_grams[n][tuple(n_gram)] += 1
                
    def normalize(self):
        for n in range(self.n, 0, -1):
            current_n_grams = self.n_grams[n]
            for words, count in current_n_grams.items():
                prev_order_n_gram_count = self.n_grams[n-1][words[:-1]]
                current_n_grams[words] = count / prev_order_n_gram_count
        self.n_grams[0][tuple()] = 1.0
    
    def predict(self, context):
        indices = [vocabulary.get_index(token) for token in context]
        context = tuple(indices[-self.n + 1:])
        step_probabilities = np.zeros((self.vocabulary.size, ), dtype=np.float64)
        for shift in range(self.n):
            current_n = self.n - shift
            wanted_context_length = current_n - 1
            if wanted_context_length > len(context):
                continue
            start_index = len(context) - wanted_context_length
            wanted_context = context[start_index:]
            
            s = 0.0
            for index in range(self.vocabulary.size):
                n_gram = wanted_context + (index,)
                p = self.n_grams[current_n].get(n_gram, 0)
                step_probabilities[index] = p
                s += p
            if s != 0.0:
                break
        return step_probabilities

vocabulary.word2index["<eos>"] = vocabulary.size
vocabulary.index2word.append("<eos>")
n_gram_model = NGramModel(vocabulary)
for text in texts[:1000]:
    n_gram_model.collect_n_grams(text + ["<eos>"])
n_gram_model.normalize()

In [None]:
seed = ["россия"]
while seed[-1] != "<eos>":
    proba = n_gram_model.predict(seed)
    seed.append(np.random.choice(vocabulary.index2word, size=1, p=proba)[0])
    print(seed)

['россия', 'помогает']
['россия', 'помогает', 'усилению']
['россия', 'помогает', 'усилению', 'турции']
['россия', 'помогает', 'усилению', 'турции', 'на']
['россия', 'помогает', 'усилению', 'турции', 'на', 'международной']
['россия', 'помогает', 'усилению', 'турции', 'на', 'международной', 'арене']
['россия', 'помогает', 'усилению', 'турции', 'на', 'международной', 'арене', '<eos>']


## ELMo (Embeddings from Language Models)

Оригинальная статья: https://arxiv.org/pdf/1802.05365.pdf

The Illustrated BERT, ELMo and co.: http://jalammar.github.io/illustrated-bert/

Как применить?

In [None]:
!wget http://vectors.nlpl.eu/repository/11/195.zip
!mkdir elmo && mv 195.zip elmo/195.zip && cd elmo && unzip 195.zip && rm 195.zip && cd ..
!ls elmo

--2023-03-27 15:52:35--  http://vectors.nlpl.eu/repository/11/195.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206977021 (197M) [application/zip]
Saving to: ‘195.zip’


2023-03-27 15:54:16 (1.99 MB/s) - ‘195.zip’ saved [206977021/206977021]

Archive:  195.zip
  inflating: meta.json               
  inflating: model.hdf5              
  inflating: options.json            
  inflating: README                  
  inflating: vocab.txt               
meta.json  model.hdf5  options.json  README  vocab.txt


In [None]:
!pip install --upgrade allennlp==0.9.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting allennlp==0.9.0
  Using cached allennlp-0.9.0-py3-none-any.whl (7.6 MB)
Collecting responses>=0.7
  Downloading responses-0.23.1-py3-none-any.whl (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.1/52.1 KB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 KB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting numpydoc>=0.8.0
  Downloading numpydoc-1.5.0-py3-none-any.whl (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.4/52.4 KB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flask-cors>=3.0.7
  Using cached Flask_Cors-3.0.10-py2.py3-none-any.whl (14 kB)
Collecting flaky
  Using cached flaky-3.7.0-py2.py3-none-any.whl (22 kB)
Collecting word2number>=1.1
  Using cached word2number-1.1.

In [None]:
from allennlp.commands.elmo import ElmoEmbedder
elmo = ElmoEmbedder(options_file="elmo/options.json", weight_file="elmo/model.hdf5", cuda_device=0)

ModuleNotFoundError: ignored

In [None]:
embeddings = elmo.batch_to_embeddings(texts[:32])[0].cpu().numpy()
print(embeddings.shape)
embeddings = embeddings.swapaxes(1, 2)
print(embeddings.shape)
embeddings = embeddings.reshape(embeddings.shape[0], embeddings.shape[1], -1)
print(embeddings.shape)
embeddings = np.mean(embeddings, axis=1)
print(embeddings.shape)
embeddings

(32, 3, 38, 1024)
(32, 38, 3, 1024)
(32, 38, 3072)
(32, 3072)


array([[ 0.01933659,  0.03807351,  0.04648911, ...,  0.27105376,
        -0.28999355, -0.02818399],
       [ 0.05876707, -0.09322704,  0.01292978, ..., -0.03196952,
        -0.11905713,  0.14430685],
       [ 0.03688765,  0.17712657,  0.1395728 , ...,  0.26800117,
        -0.00575656,  0.00803849],
       ...,
       [ 0.01002959, -0.01883668,  0.01405345, ...,  0.00405536,
         0.00785995,  0.02182663],
       [-0.0501788 , -0.06883522,  0.06019326, ...,  0.93084395,
        -0.15911382,  0.1931586 ],
       [ 0.09397829,  0.01189152,  0.11018443, ...,  0.12281392,
         0.01181271,  0.203925  ]], dtype=float32)

# Обзор моделей

* 1) Пословные эмбеддинги:
  - Word2Vec: CBoW <- deprecated
  - Word2Vec: Skip-gram <- deprecated
  - GloVe <- deprecated
  - FastText <- Когда требуется производительность, либо когда требуются именно пословные эмбеддинги - 1 вариант
* 2) LM-based эмбеддинги:
  - Ulmfit <- deprecated
  - ELMo <- deprecated
* 3) MLM-based эмбеддинги:
  - BERT <- deprecated
  - XLMRoBERTa <- 2 вариант
* 4) NSP-based эмбеддинги
  - DSSM-like <- Когда требуется производительность, в 15 раз быстрее, чем ELMo
  - LSTM-like
* 5) Denoising-based энкодеры (text2text задача, например машиный перевод, текстовая суммаризация)
  - mBART 
  - T5
  - BERT с декодером (BertSumAbs)
* 6) MT-based эмбеддинги
  - LASER <- 2 вариант
* 7) Multitask
  - USE
* 8) TfIdf эмбеддинги - 0 вариант
* 9) Свой:
  - Новости: подбор заголовка к тексту (можно вставить fasttext)
  - Поиск: подбор документа к запросу

Рецепт:
* 1) TfIdf
* 2) Fasttext
* 3) XLMRoBERTa или LASER


## Задание 3: Рубрикация: ELMo или XLMRoBERa или LASER или USE

Проверьте, как одна из этих моделей работает в задаче рубрикации

In [None]:
# XLMRoBERa для русского  - https://huggingface.co/DeepPavlov/xlm-roberta-large-en-ru
# пример работы с моделью - https://huggingface.co/xlm-roberta-base

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/xlm-roberta-large-en-ru")
model = AutoModel.from_pretrained("DeepPavlov/xlm-roberta-large-en-ru")

Downloading:   0%|          | 0.00/582 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/922k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.44M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/238 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/722 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

In [None]:
device = torch.device("cpu")
model = model.to(device)

In [None]:
def get_text_embedding(tokenizer, model, phrase):
    text = "Replace me by any text you'd like."
    encoded_input = tokenizer(text, return_tensors='pt').to(device)
    with torch.no_grad():
        output = model(**encoded_input)
    embedding = torch.mean(output.last_hidden_state, dim=1)
    return embedding

In [None]:
target_labels = set(train_dataset["topic"].dropna().tolist())
target_labels -= {"69-я параллель", "Крым", "Культпросвет ", "Оружие", "Бизнес", "Путешествия"}
target_labels = list(target_labels)
print(target_labels)

pattern = r'(\b{}\b)'.format('|'.join(target_labels))

train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
train_with_topics = train_with_topics.head(20000)

test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]

y_train = train_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_train = np.zeros((train_with_topics.shape[0], 1024))
for i, embedding in enumerate(tqdm(train_with_topics["text"])):
    X_train[i, :] = get_text_embedding(tokenizer, model, embedding)

y_test = test_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_test = np.zeros((test_with_topics.shape[0], 1024))
for i, embedding in enumerate(tqdm(test_with_topics["text"])):
    X_test[i, :] = get_text_embedding(tokenizer, model, embedding)

print(X_train.shape)
print(y_train)

['Бывший СССР', 'Интернет и СМИ', 'Культура', 'Россия', 'Мир', 'Силовые структуры', 'Наука и техника', 'Экономика', 'Дом', 'Ценности', 'Из жизни', 'Спорт']


  train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
  test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]
100%|██████████| 20000/20000 [2:01:50<00:00,  2.74it/s]
100%|██████████| 30159/30159 [3:02:15<00:00,  2.76it/s]

(20000, 1024)
[4 4 2 ... 1 7 9]





In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

clf = MLPClassifier()
clf.fit(X_train, y_train)

y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      2156
           1       0.00      0.00      0.00      2447
           2       0.00      0.00      0.00      1995
           3       0.14      1.00      0.25      4324
           4       0.00      0.00      0.00      4291
           5       0.00      0.00      0.00      1663
           6       0.00      0.00      0.00      2119
           7       0.00      0.00      0.00      3185
           8       0.00      0.00      0.00      1182
           9       0.00      0.00      0.00      1177
          10       0.00      0.00      0.00      2191
          11       0.00      0.00      0.00      3429

    accuracy                           0.14     30159
   macro avg       0.01      0.08      0.02     30159
weighted avg       0.02      0.14      0.04     30159



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
