# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from "An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017", also desribed in seminar notes.  

You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

In [0]:
import numpy as np
import pandas as pd
import nltk

import torch as tt
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, TensorDataset
from torchtext.data import Field, TabularDataset, BucketIterator, Iterator

from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook

DEVICE = tt.device('cuda') if tt.cuda.is_available() else tt.device('cpu')

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
import os
os.chdir('gdrive/My Drive/Colab Notebooks')

In [0]:
f = open('data.txt').read()

In [0]:
stop_words = open('stopwords.txt').read().split('\n')

In [9]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
from nltk.tokenize import sent_tokenize, word_tokenize

texts = nltk.sent_tokenize(f)[:5000] #весь датасет очень долго обрабатывался, пришлось взять только часть

In [0]:
df = pd.DataFrame(data=texts, columns=['texts'])

In [47]:
df.head()

Unnamed: 0,texts
0,Barclays' defiance of US fines has merit Barcl...
1,"So it is tempting to think the bank, when aske..."
2,"That is not the view of the chief executive, J..."
3,Barclays thinks the DoJ’s claims are “disconne...
4,"But actually, some grudging respect for Staley..."


Создадим негативные сэмплы

In [0]:
def neg_cand(ind):
  neg_cands = [x for x in range(len(df.texts)) if x != ind]
  neg_idx = np.random.choice(neg_cands)
  return df.iloc[neg_idx, 0]

In [0]:
for i in range(3): #let's do 3 negative samples
  df[f'neg_{i+1}'] = [neg_cand(ind) for ind,text in enumerate(df.texts)]

In [50]:
'''df['neg_1'] = neg_samples[0]
df['neg_2'] = neg_samples[1]
df['neg_3'] = neg_samples[2]'''
df.head()

Unnamed: 0,texts,neg_1,neg_2,neg_3
0,Barclays' defiance of US fines has merit Barcl...,The big difference was that these surveys pick...,The most striking conclusion reached by Monito...,That is why it is wrong to regard the crisis a...
1,"So it is tempting to think the bank, when aske...","Derek Gambell Bromley, Kent • Your coverage of...",Join our community of development professional...,“At first I was seduced by his showmanship and...
2,"That is not the view of the chief executive, J...","Rachel Reeves, a Labour MP who sits on the com...","“We had actual published interviews with him, ...",With a world ranking of 23rd for download spee...
3,Barclays thinks the DoJ’s claims are “disconne...,"“If they weren’t so deeply troubling, these re...","Microsoft, for example, is doing a SID takeove...",The underlying problems of the economy are the...
4,"But actually, some grudging respect for Staley...",Though Labour in office had tried and failed t...,"Matthew Elliott, chief executive of Vote Leave...","And please email, text or phone all your frien..."


In [0]:
df.to_csv('texts.csv', index=False)

Два варианта токенизации

In [0]:
#def clean_data(l):
#  return [i for i in l if (i not in stop_words)]
def tokenize(text):
    return [tok for tok in nltk.word_tokenize(text) if tok not in stop_words]
    
texts_tok = []
for row in df.texts:
  texts_tok.append(tokenize(row))

In [0]:
#второй вариант токенизации (из оригинального исследования + custom stopwords)
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import codecs

def parseSentence(line):
    lmtzr = WordNetLemmatizer()    
    stop = stopwords.words('english')
    stop.extend(stop_words)
    text_token = CountVectorizer().build_tokenizer()(line.lower())
    text_rmstop = [i for i in text_token if i not in stop]
    text_stem = [lmtzr.lemmatize(w) for w in text_rmstop]
    return text_stem

texts_tok_2 = []
for row in df.texts:
  texts_tok_2.append(parseSentence(row))

Используем word2vec с теми же параметрами, что и в оригинальном исследовании

In [57]:
from gensim.models import Word2Vec, KeyedVectors
from torchtext.vocab import Vectors

model = Word2Vec(texts_tok, size=200, window=5, min_count=10, workers=4)
model_weights = tt.FloatTensor(model.wv.vectors)
model.wv.save_word2vec_format('pretrained_embeddings')
vectors = Vectors(name='pretrained_embeddings', cache='./') 

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  0%|          | 0/1344 [00:00<?, ?it/s]Skipping token b'1344' with 1-dimensional vector [b'200']; likely a header
 85%|████████▍ | 1141/1344 [00:00<00:00, 9590.35it/s] 


Данные для торча

In [0]:
TEXT = Field(include_lengths=False, 
             batch_first=True, 
             tokenize = parseSentence)

In [0]:
dataset = TabularDataset(path="texts.csv",
                     format='csv',
                     skip_header=True,
                     fields=[('text', TEXT),('neg_1', TEXT), ('neg_2', TEXT), ('neg_3', TEXT)])

In [0]:
TEXT.build_vocab(dataset,
                 vectors = vectors, 
                 unk_init = tt.Tensor.normal_)

vocab_size = len(TEXT.vocab.itos)

In [0]:
#from sklearn.model_selection import train_test_split
SEED = 42
train, test = dataset.split(0.8, random_state=np.random.seed(SEED))
train, valid = train.split(0.9, random_state=np.random.seed(SEED))

In [0]:
batch_size = 256
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    shuffle=True,
    sort_key=lambda x: len(x.text),
    device=DEVICE
)

Построим модель

In [0]:
emb_size = 200
topic_num = 5

In [0]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size = vocab_size, emb_dim=emb_size, topic_dim=topic_num):
      super(MyModel, self).__init__()
      self.embeddings = nn.Embedding(vocab_size, emb_dim)
      self.embeddings.weight.data.copy_(TEXT.vocab.vectors)

      self.fc1 = nn.Linear(emb_dim, topic_dim)
      self.soft = F.softmax
      self.fc2 = nn.Linear(topic_dim, emb_dim, bias=False)

    def forward(self, batch):
        
        text, neg_1, neg_2, neg_3 = batch.text, batch.neg_1, batch.neg_2, batch.neg_3

        text_true = tt.sum((self.embeddings(text)), dim=1)/(self.embeddings(text)).size()[1]

        text_out = self.fc1(text_true)
        text_out = self.soft(text_out, dim=1)
        text_out = self.fc2(text_out)

        neg_1 = tt.sum((self.embeddings(neg_1)), dim=1)/(self.embeddings(neg_1)).size()[1]
        neg_2 = tt.sum((self.embeddings(neg_2)), dim=1)/(self.embeddings(neg_2)).size()[1]
        neg_3 = tt.sum((self.embeddings(neg_3)), dim=1)/(self.embeddings(neg_3)).size()[1]
        negs = [neg_1, neg_2, neg_3]
        negs = tt.stack(negs, dim=-1)

        return text_true, text_out, negs

In [0]:
model = MyModel()

In [0]:
model = model.to(DEVICE)

In [0]:
class LossFunc(nn.Module):
  
    def __init__(self, lmbd = 1):
        super().__init__()
        self.lmbd = lmbd

    def forward(self, emb_true, emb_pred, negs, param):
        losses = []
        emb_true = emb_true.unsqueeze(1).permute(0, 2, 1)
        emb_pred = emb_pred.unsqueeze(1).permute(0, 2, 1)
        for n in negs.permute(2, 0, 1):
            b = tt.bmm(emb_true, n.unsqueeze(1))
            tmp = (1 - emb_pred + b).squeeze(1)
            loss, _ = tt.max(tt.stack([tmp, tt.zeros_like(tmp)]), 0)
            losses.append(loss)

        inn = tt.mm(param.permute(1, 0), param) 
        reg = inn - tt.eye(inn.shape[0])
        reg = tt.norm(reg, p='fro') * self.lmbd

        res = tt.sum(tt.stack(losses, dim=-1)) + reg

        return res

In [0]:
criterion = LossFunc()
criterion.to(DEVICE)

optimizer = tt.optim.Adam(model.parameters())

In [0]:
def _train_epoch(model, iterator, optimizer, curr_epoch):

    model.train()

    running_loss = 0

    n_batches = len(iterator)
    iterator = tqdm_notebook(iterator, total=n_batches, desc='epoch %d' % (curr_epoch), leave=True)

    for i, batch in enumerate(iterator):
        optimizer.zero_grad()

        true, pred, negs = model(batch)
        param = model.fc2.weight
        loss = criterion(true, pred, negs, param)
        loss.backward()
        optimizer.step()

        curr_loss = loss.data.cpu().detach().item()
        
        loss_smoothing = i / (i+1)
        running_loss = loss_smoothing * running_loss + (1 - loss_smoothing) * curr_loss

        iterator.set_postfix(loss='%.5f' % running_loss)

    return running_loss

In [0]:
def _test_epoch(model, iterator):
    model.eval()
    epoch_loss = 0

    n_batches = len(iterator)
    with tt.no_grad():
        for batch in iterator:
            true, pred, negs = model(batch)
            param = model.fc2.weight
            loss = criterion(true, pred, negs, param)
            epoch_loss += loss.data.item()

    return epoch_loss / n_batches

In [0]:
def nn_train(model, train_iterator, valid_iterator, optimizer, n_epochs=2,
          scheduler=None, early_stopping=0):

    prev_loss = 100500
    es_epochs = 0
    best_epoch = None
    history = pd.DataFrame()

    for epoch in range(n_epochs):
        train_loss = _train_epoch(model, train_iterator, optimizer, epoch)
        valid_loss = _test_epoch(model, valid_iterator)

        valid_loss = valid_loss
        print('validation loss %.5f' % valid_loss)

        record = {'epoch': epoch, 'train_loss': train_loss, 'valid_loss': valid_loss}
        history = history.append(record, ignore_index=True)

        if early_stopping > 0:
            if valid_loss > prev_loss:
                es_epochs += 1
            else:
                es_epochs = 0

            if es_epochs >= early_stopping:
                best_epoch = history[history.valid_loss == history.valid_loss.min()].iloc[0]
                print('Early stopping! best epoch: %d val %.5f' % (best_epoch['epoch'], best_epoch['valid_loss']))
                break

            prev_loss = min(prev_loss, valid_loss)

In [205]:
nn_train(model, train_iterator, valid_iterator, optimizer, n_epochs=10)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(IntProgress(value=0, description='epoch 0', max=15, style=ProgressStyle(description_width='init…


validation loss 23681995.00000


HBox(children=(IntProgress(value=0, description='epoch 1', max=15, style=ProgressStyle(description_width='init…


validation loss 23215733.00000


HBox(children=(IntProgress(value=0, description='epoch 2', max=15, style=ProgressStyle(description_width='init…


validation loss 22719191.00000


HBox(children=(IntProgress(value=0, description='epoch 3', max=15, style=ProgressStyle(description_width='init…


validation loss 22132205.50000


HBox(children=(IntProgress(value=0, description='epoch 4', max=15, style=ProgressStyle(description_width='init…


validation loss 21545952.00000


HBox(children=(IntProgress(value=0, description='epoch 5', max=15, style=ProgressStyle(description_width='init…


validation loss 21004463.50000


HBox(children=(IntProgress(value=0, description='epoch 6', max=15, style=ProgressStyle(description_width='init…


validation loss 20512399.00000


HBox(children=(IntProgress(value=0, description='epoch 7', max=15, style=ProgressStyle(description_width='init…


validation loss 20045159.50000


HBox(children=(IntProgress(value=0, description='epoch 8', max=15, style=ProgressStyle(description_width='init…


validation loss 19564526.50000


HBox(children=(IntProgress(value=0, description='epoch 9', max=15, style=ProgressStyle(description_width='init…


validation loss 19017754.00000


In [196]:
#nn_train(model, train_iterator, valid_iterator, optimizer, n_epochs=2)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(IntProgress(value=0, description='epoch 0', max=15, style=ProgressStyle(description_width='init…


validation loss 23527075.00000


HBox(children=(IntProgress(value=0, description='epoch 1', max=15, style=ProgressStyle(description_width='init…


validation loss 23091121.50000
