# Assignment 9

Use data from https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip
Implement model in pytorch from "An Unsupervised Neural Attention Model for Aspect Extraction, He et al, 2017", also desribed in seminar notes.

You can use sentence embeddings with attention **[7 points]**:

$z_s = \sum_{i}^n \alpha_i e_{w_i}, z_s \in R^d$ sentence embedding

$\alpha_i = softmax(d_i)$ attention weight for i-th token

$d_i = e_{w_i}^T M y_s$ attention with trainable matrix $M \in R^{dxd}$

$y_s = \frac 1 n \sum_{i=1}^n e_{w_i}, y_s \in R^d$ sentence context

$e_{w_i} \in R^d$, token embedding of size d

$n$ - number of tokens in a sentence

**Or** just use sentence embedding as an average over word embeddings **[5 points]**:

$z_s = \frac 1 n \sum_{i=1}^n e_{w_i}, z_s \in R^d$ sentence embedding

$e_{w_i} \in R^d$, token embedding of size d

$n$ - number of tokens in a sentence

$p_t = softmax(W z_s + b), p_t \in R^K$ topic weights for sentence 
$s$, with trainable matrix $W \in R^{dxK}$ and bias vector $b \in R^K$

$r_s = T^T p_t, r_s \in R^d$ reconstructed sentence embedding as a weighted sum of topic embeddings

$T \in R^{Kxd}$ trainable matrix of topic embeddings, K=number of topics

**Training objective:**$$ J = \sum_{s \in D} \sum_{i=1}^n max(0, 1-r_s^T z_s + r_s^T n_i) + \lambda ||T^T T - I ||^2_F  $$where
$m$ random sentences are sampled as negative examples from dataset $D$ for each sentence $s$

$n_i = \frac 1 n \sum_{i=j}^n e_{w_j}$ average of word embeddings in the i-th sentence

$||T^T T - I ||_F$ regularizer, that enforces matrix $T$ to be orthogonal

$||A||^2_F = \sum_{i=1}^N\sum_{j=1}^M a_{ij}^2, A \in R^{NxM}$ Frobenius norm

**[3 points]** Compute topic coherence for at least for 3 different number of topics. Use 10 nearest words for each topic. It means you have to train one model for each number of topics. You can use code from seminar notes with word2vec similarity scores.

### Importing data

In [1]:
!wget https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip

--2020-03-22 18:13:12--  https://github.com/thedenaas/hse_seminars/tree/master/2018/seminar_13/data.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip [following]
--2020-03-22 18:13:17--  https://github.com/thedenaas/hse_seminars/blob/master/2018/seminar_13/data.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘data.zip’

data.zip                [ <=>                ]  64.25K  --.-KB/s    in 0.03s   

2020-03-22 18:13:17 (2.37 MB/s) - ‘data.zip’ saved [65794]



In [3]:
!unzip data.zip

Archive:  data.zip
  inflating: data.txt                
  inflating: stopwords.txt           


In [0]:
with open('data.txt', 'r', encoding = 'utf-8') as f:
    content = f.readlines()

In [5]:
len(content), type(content)

(4551, list)

In [6]:
content[0]

"Barclays' defiance of US fines has merit Barclays disgraced itself in many ways during the pre-financial crisis boom years. So it is tempting to think the bank, when asked by US Department of Justice to pay a large bill for polluting the financial system with mortgage junk between 2005 and 2007, should cough up, apologise and learn some humility. That is not the view of the chief executive, Jes Staley. Barclays thinks the DoJ’s claims are “disconnected from the facts” and that it has “an obligation to our shareholders, customers, clients and employees to defend ourselves against unreasonable allegations and demands.” The stance is possibly foolhardy, since going into open legal battle with the most powerful US prosecutor is risky, especially if you end up losing. But actually, some grudging respect for Staley and Barclays is in order. The US system for dishing out fines to errant banks for their mortgage sins has come to resemble a casino. The approach prefers settlements behind close

In [7]:
content[1]

"How big is Hillary Clinton's lead in the presidential race? It depends on the poll Democratic candidate Hillary Clinton now has an 11-percentage-point lead over her Republican opponent Donald Trump, according to a poll released by PRRI and the Atlantic on Tuesday. If that weren’t already reason enough for Trump supporters to worry, a poll from NBC and the Wall Street Journal released on Monday put Clinton’s lead at 14 percentage points. But why the difference in numbers? If you want to follow polls in the 28 remaining days before the US votes, I strongly recommend you ignore the date that the poll was published – and focus instead on the dates that the poll was conducted. That PRRI/Atlantic poll was based on landline and cellphone interviews that took place on 5-9 October while the data for the NBC/WSJ poll was gathered on 8-9 October. Those dates are potentially significant given that on 8 October, a 2005 recording was released of Trump saying that, thanks to his fame, he was able to

In [0]:
with open('stopwords.txt', 'r', encoding = 'utf-8') as f:
    stopwords = f.readlines()

In [9]:
len(stopwords), type(stopwords)

(350, list)

### Preparing data

In [0]:
stopwords = set([x.strip() for x in stopwords])

In [0]:
import re

In [0]:
data = [''] * len(content)
for i, line in enumerate(content):
    new_line = []
    for t in line.split():
        if t not in stopwords:
            new_line.append(t)
    new_line = ' '.join(new_line)
    data[i] = new_line
# to do:
# 1. разбить на предложения
# 2. для каждого предложения извлечь m случайных предложений как отрицательные примеры - попробовать 5

In [13]:
data[0]

"Barclays' defiance US fines merit Barclays disgraced ways pre-financial crisis boom years. So tempting think bank, asked US Department Justice pay large bill polluting financial system mortgage junk 2005 2007, cough up, apologise learn humility. That view chief executive, Jes Staley. Barclays thinks DoJ’s claims “disconnected facts” “an obligation shareholders, customers, clients employees defend against unreasonable allegations demands.” The stance possibly foolhardy, going open legal battle powerful US prosecutor risky, especially losing. But actually, grudging respect Staley Barclays order. The US system dishing fines errant banks mortgage sins resemble casino. The approach prefers settlements behind closed doors difference size penalties explained. Occasional leaks negotiating demands methodology appear arbitrary. Deutsche Bank initially asked $14bn (£11.5bn), reached settlement $7.2bn Thursday. Where rhyme reason? There strong suspicion roulette wheel weighted against Europeans. 

### Preparing model

In [16]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2020-03-22 18:22:59--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.144.237
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.144.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-03-22 18:23:17 (86.9 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [0]:
def reconstruction_loss(r_s, z_s, n_i, T, lambda = 1):
    # regularization_term = torch.norm(tt.matmul(tt.t(T), T) - lambda*tt.matrix_power(T, 0))
    # reconstruction_loss = MarginRankingLoss
    # return reconstruction_loss + regularization_term
    return NotImplementedError
    
class ABAE(nn.Module):
    def __init__(self, embedding_matrix):
        super(ABAE, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(embedding_matrix))
        self.fc = nn.Linear(fc_num, hidden_dim)
        self.softmax = nn.Softmax()
        
    def forward(self, batch):
        sentence, neg_sentences = batch
        word_embeddings = self.embedding(sentence) # sequence of vectors
        y_s = word_embeddings.mean(dim=0) # a single vector
        p_t = self.softmax(self.fc(y_s)) # a single vector - shorter one
        r_s = tt.matmul(tt.t(T), p_t)
        loss = 0
        for neg_sentence in neg_sentences:
            n_i = self.embedding(neg_sentence).mean(dim=1)
            loss += reconstruction_loss(r_s, z_s, n_i, T, lambda=1)
        return loss

In [0]:
import torch

In [0]:
def _train_epoch(model, iterator, optimizer, curr_epoch):

    model.train()

    running_loss = 0

    n_batches = len(iterator)
    iterator = tqdm_notebook(iterator, total=n_batches, desc='epoch %d' % (curr_epoch), leave=True)

    for i, batch in enumerate(iterator):
        optimizer.zero_grad()

        loss = model(batch)
        loss.backward()
        optimizer.step()

        curr_loss = loss.data.cpu().detach().item()
        
        loss_smoothing = i / (i+1)
        running_loss = loss_smoothing * running_loss + (1 - loss_smoothing) * curr_loss

        iterator.set_postfix(loss='%.5f' % running_loss)

    return running_loss

def _test_epoch(model, iterator):
    model.eval()
    epoch_loss = 0

    n_batches = len(iterator)
    with tt.no_grad():
        for batch in iterator:
            loss = model(batch)
            epoch_loss += loss.data.item()

    return epoch_loss / n_batches

### Topic Coherence

In [0]:
kmin, kmax = 4, 15

topic_models = []
for k in tqdm(range(kmin,kmax+1)):
    model = TruncatedSVD(n_components=k) 
    W = model.fit_transform(A)
    H = model.components_    
    topic_models.append((k,W,H))

In [0]:
import gensim

w2v_model = gensim.models.Word2Vec.load("w2v-model.bin")
print("Model has %d terms" % len(w2v_model.wv.vocab))

In [0]:
from itertools import combinations


def calculate_coherence(w2v_model, term_rankings):
    overall_coherence = 0.0
    for topic_index in range(len(term_rankings)):
        # check each pair of terms
        pair_scores = []
        for pair in combinations(term_rankings[topic_index], 2):
            pair_scores.append(w2v_model.similarity(pair[0], pair[1]))
        # get the mean for all pairs in this topic
        topic_score = sum(pair_scores) / len(pair_scores)
        overall_coherence += topic_score
    # get the mean score across all topics
    return overall_coherence / len(term_rankings)


k_values = []
coherences = []
for (k,W,H) in topic_models:
    # Get all of the topic descriptors - the term_rankings, based on top 10 terms
    term_rankings = []
    for topic_index in range(k):
        term_rankings.append(get_descriptor(terms, H, topic_index, 10))
    # Now calculate the coherence based on our Word2vec model
    k_values.append(k)
    coherences.append(calculate_coherence(w2v_model, term_rankings))
    print("K=%02d: Coherence=%.4f" % (k, coherences[-1]))

In [0]:
fig = plt.figure(figsize=(13,7))
# create the line plot
ax = plt.plot(k_values, coherences)
plt.xticks(k_values)
plt.xlabel("Number of Topics")
plt.ylabel("Mean Coherence")
# add the points
plt.scatter( k_values, coherences, s=120)
# find and annotate the maximum point on the plot
ymax = max(coherences)
xpos = coherences.index(ymax)
best_k = k_values[xpos]
plt.annotate( "k=%d" % best_k, xy=(best_k, ymax), xytext=(best_k, ymax), textcoords="offset points", fontsize=16)
# show the plot
plt.show()

In [0]:
k = best_k
# get the model that we generated earlier.
W = topic_models[k-kmin][1]
H = topic_models[k-kmin][2]

for topic_index in range(k):
    descriptor = get_descriptor(terms, H, topic_index, 10)
    str_descriptor = ", ".join(descriptor)
    print("Topic %02d: %s" % ( topic_index+1, str_descriptor ))