<a href="https://colab.research.google.com/github/flying-bear/2018-course-poster/blob/master/assignment_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 9

Use data from `https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip`  
Implement model in pytorch from ["An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017"](https://www.comp.nus.edu.sg/~leews/publications/acl17.pdf), also desribed in seminar notes.  


You can use sentence embeddings with attention **[7 points]**:  
$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding  
$\alpha_i = softmax(d_i)$  attention weight for i-th token  
$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$  
$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
 
$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


**Training objective**:
$$ J = \sum_{s \in D} \sum_{i=1}^m max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm


**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

## Get data

In [3]:
!wget -O data.zip https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true 

--2020-03-20 13:40:54--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip?raw=true
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip [following]
--2020-03-20 13:40:54--  https://github.com/thedenaas/hse_seminars/raw/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip [following]
--2020-03-20 13:40:55--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2018/seminar_13/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.1

In [4]:
!unzip data.zip

Archive:  data.zip
  inflating: data.txt                
  inflating: stopwords.txt           


## video memeory?


In [5]:
!nvidia-smi

Fri Mar 20 13:41:02 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

## Imports

In [0]:
import gensim
import gensim.downloader as api
import matplotlib.pyplot as plt
import numpy as np
import nltk
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import DataLoader, TensorDataset
from torchtext.data import Field, TabularDataset, Iterator

from scipy.ndimage.filters import gaussian_filter1d
from tqdm import tqdm, tqdm_notebook

In [7]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
batch_size = 256
random_state = 42
num_neg_samples = 5

In [0]:
DEVICE = torch.device('cuda') if torch.cuda.is_available() else tt.device('cpu')

## Data

In [10]:
with open('data.txt', 'r') as f:
  text = f.read()
print(text[:100])

Barclays' defiance of US fines has merit Barclays disgraced itself in many ways during the pre-finan


In [11]:
len(text.split('/n'))

13

In [0]:
custom_stop_words = []
with open( "stopwords.txt", "r" ) as f:
    for line in f.readlines():
        custom_stop_words.append( line.strip().lower())

### Sentences

In [0]:
def sent_tokenize(text):
  return nltk.tokenize.sent_tokenize(text)

def tokenize(sent):
  return nltk.tokenize.word_tokenize(sent)

In [14]:
sent_tokenize('ads kjd.\n hadkjahd! jhwahjw! a the I')

['ads kjd.', 'hadkjahd!', 'jhwahjw!', 'a the I']

In [15]:
tokenize('jkahdak akwjhadkjhd jhjh hh')

['jkahdak', 'akwjhadkjhd', 'jhjh', 'hh']

### add negative samples


In [0]:
df = pd.DataFrame()
df['pos'] = sent_tokenize(text)

In [17]:
df.tail()

Unnamed: 0,pos
183395,It feels as though Stone realised that some of...
183396,"There are some fun elements, many involving Rh..."
183397,I particularly enjoyed a scene in which O’Bria...
183398,His carnivorous snarl fills the immense screen...
183399,There’s a playful visual flair to this moment ...


In [0]:
def add_negative(df):
  neg_id = np.random.choice(len(df))
  return df.iloc[neg_id, 0]

In [0]:
for i in range(num_neg_samples):
  df[f'neg{i}'] = df['pos'].apply(lambda x: add_negative(df))

In [20]:
df.tail()

Unnamed: 0,pos,neg0,neg1,neg2,neg3,neg4
183395,It feels as though Stone realised that some of...,The bank said in a long list of legal disclosu...,Get the latest news with Barry Glendenning.,I was a journalist with an addiction to social...,"The video, which ends with Drėgvaitė saying “f...",How much difference has she made?
183396,"There are some fun elements, many involving Rh...","But it feels, too, like the end of a broader m...",“They told me I needed to wait my turn.,"“Now, Carlos Slim comes from Mexico.","In Steven Spielberg’s cold war thriller, he pl...",Dutch people feel big decisions are being made...
183397,I particularly enjoyed a scene in which O’Bria...,It’s way more optimistic than most other estim...,"Over the years, they would create their own st...",But it’s important to try to get a handle on w...,We used to be called ironic all the time.,“This is not the time to fear the European Uni...
183398,His carnivorous snarl fills the immense screen...,Godfrey said that around 400-500 new homes a y...,"In the year of her temporary retirement, 1969,...",Become a member An impassioned Joe Biden calle...,The female sex hormones oestrogen and progeste...,The firm said: “This suggests people are nervo...
183399,There’s a playful visual flair to this moment ...,The lightshow (!),25 min Salomón Rondón scored three headers fro...,“It’s a rigged election.,"Speaking to The Wrap, Taylor Sheridan said: “T...","“We have not, to date, specifically advised th..."


In [0]:
df.to_csv('text.csv', index=False)

### Batch

In [0]:
TEXT = Field(include_lengths=False, 
             batch_first=True, 
             tokenize=tokenize,
             lower=True,
             stop_words=custom_stop_words)

datafields = [('pos',TEXT), *[(f'neg{i}', TEXT) for i in range(num_neg_samples)]]

In [0]:
trn = TabularDataset(path="text.csv",
                     format='csv',
                     skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
                     fields=datafields)

In [0]:
TEXT.build_vocab(trn)

In [0]:
vocab_size = len(TEXT.vocab.itos) 

### Iterator

In [0]:
trn_itr = Iterator(trn, batch_size, device=DEVICE, shuffle=True)

In [27]:
example_batch = next(iter(trn_itr))
example_batch


[torchtext.data.batch.Batch of size 256]
	[.pos]:[torch.cuda.LongTensor of size 256x499 (GPU 0)]
	[.neg0]:[torch.cuda.LongTensor of size 256x58 (GPU 0)]
	[.neg1]:[torch.cuda.LongTensor of size 256x57 (GPU 0)]
	[.neg2]:[torch.cuda.LongTensor of size 256x66 (GPU 0)]
	[.neg3]:[torch.cuda.LongTensor of size 256x56 (GPU 0)]
	[.neg4]:[torch.cuda.LongTensor of size 256x53 (GPU 0)]

## Neural Network

just use sentence embedding as an average over word embeddings **[5 points]**:  
$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding  
$e_{w_i} \in R^d$, token embedding of size d  
$n$ - number of tokens in a sentence  
(implemented with ```nn.EmbeddingBag```)

$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence $s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$  
$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings   
$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics


In [28]:
pad_id = TEXT.vocab.stoi['<pad>']
pad_id

1

In [0]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, emb_dim=300, topic_dim=5):
      super(MyModel, self).__init__()
      self.embedding = nn.EmbeddingBag(vocab_size, emb_dim)  ## how do I ignore the padding?
      self.pt = nn.Linear(emb_dim, topic_dim)
      self.soft = F.softmax
      self.rs = nn.Linear(topic_dim, emb_dim, bias=False)

    def forward(self, batch):
      emb_x = self.embedding(batch.pos)
      x = self.pt(emb_x)
      x = self.soft(x)
      x = self.rs(x) 
      
      negs = [self.embedding(batch.neg0), 
              self.embedding(batch.neg1), 
              self.embedding(batch.neg2),
              self.embedding(batch.neg3),
              self.embedding(batch.neg4),]  ## how do I generalize this to different num_neg_samples?
      negs = torch.stack(negs, dim=-1)

      return vecs_rec, vecs_true, negs

In [38]:
model = MyModel(vocab_size)
model = model.to(DEVICE)
x, emb_x, negs = model(example_batch)
x.shape

  del sys.path[0]


torch.Size([256, 300])

In [31]:
emb_x.shape

torch.Size([256, 300])

In [32]:
negs.shape

torch.Size([256, 300, 5])

In [33]:
list(model.parameters())[0].shape

torch.Size([95799, 300])

In [34]:
model.embedding.weight.shape

torch.Size([95799, 300])

## Loss
**Training objective**:

$$ J = \sum_{s \in D} \sum_{i=1}^m max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$
where   
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$  
$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence  
$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal  
$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm

In [0]:
class MyLoss(nn.Module):
  def __init__(self, lmbd=0.01):
    super(MyLoss, self).__init__()  
    self.lmbd = lmbd

  def forward(self, vecs_true, negs, vecs_rec, T):
    vecs_true = vecs_true.unsqueeze(1) ## add dimension for bmm
    rs = vecs_rec.unsqueeze(1) ## add dimension for bmm
    rsT = rs.permute(0, 2, 1) ## transpose
    rsTzs = torch.bmm(rsT, vecs_true)
    negs_losses = []
    for ni in negs.permute(2, 0, 1):  ## so that we iterate over the neg samples
      ni = ni.unsqueeze(1) ## add dimension for bmm
      negs_losses.append(torch.bmm(rsT, ni))
    losses = []
    for n_loss in negs_losses:
      tmp = (1 - rsTzs + n_loss).squeeze(1)
      zeros = torch.zeros_like(tmp).to(DEVICE)
      values, idx = torch.max(torch.stack([tmp, zeros]), 0)
      losses.append(values)
    losses = torch.stack(losses, dim=-1)
    reg_0 = torch.mm(T.permute(1,0), T)
    reg = self.lmbd  * (torch.norm(reg_0 - torch.eye(reg_0.shape[0]).to(DEVICE), p='fro')) #
    return torch.sum(losses) + reg

In [42]:
criterion = MyLoss()
criterion.to(DEVICE)
criterion(emb_x, negs, x, model.embedding.weight)

tensor(1.1522e+08, device='cuda:0', grad_fn=<AddBackward0>)

## Train

**TODO CHANGE THE EPOCH TRAIN TO THE NEW ARCHITECTURE**

In [0]:
num_epochs = 10
optimizer = torch.optim.Adam(model.parameters())

In [0]:
def train_epoch(data_iter, len_iter, n_epoch, model, criterion, optimizer=None):
    train_losses = []
    total_loss = 0
    data_iter = tqdm_notebook(data_iter, total=len_iter, desc=f"Epoch {n_epoch + 1}", leave=True)
    counter = 0
    for batch in data_iter:
        if optimizer:
          optimizer.zero_grad()
        vecs_rec, vecs_true, negs = model.forward(batch)
        loss = criterion(vecs_true, negs, vecs_rec, model.embedding.weight)
        loss.backward()
        if optimizer:
          optimizer.step()
        loss_value = loss.detach().item()
        total_loss += loss_value
        train_losses.append(loss_value)
        data_iter.set_postfix(loss = loss_value)
        counter += 1
        
    total_loss /= counter
    return total_loss, train_losses

In [0]:
total_train_losses = []
total_valid_losses = []
for epoch in range(num_epochs):
    model.train()
    loss, train_losses = train_epoch(trn_itr, len(trn_itr), epoch, model, criterion, optimizer)
    total_train_losses += train_losses
    print('train', loss)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(IntProgress(value=0, description='Epoch 1', max=717, style=ProgressStyle(description_width='ini…

  del sys.path[0]



train 109404639.0906555


HBox(children=(IntProgress(value=0, description='Epoch 2', max=717, style=ProgressStyle(description_width='ini…

In [0]:
smooth = lambda y: gaussian_filter1d(y, sigma=10)

plt.figure(figsize=(14, 10))
plt.plot(range(len(total_train_losses)), smooth(total_train_losses), np.array(range(len(total_valid_losses)))*(len(total_train_losses)/len(total_valid_losses)), smooth(total_valid_losses))
plt.legend(('train loss', 'valid loss by batch'),
           loc='center', prop={'size': 18})
plt.title('Smoothed training process', fontsize=20)
plt.xlabel('Iterations', fontsize=16)
plt.ylabel('Loss function (smoothed)', fontsize=16)
plt.show()

## Topic Coherence

In [0]:
wv = api.load('word2vec-google-news-300')
wv['king'][:10]

In [0]:
emb_dim = wv['king'].size