# RAAI Summer School 2019

This tutorial is prepared by Ivan Fursov at Tinkoff.

Telegram: [@fursov](https://tele.click/fursov)

# Paraphrase identification

**Download** files from [here](https://yadi.sk/d/hvxpunMTd2xj2g)

Task: given a pair of sentences, classify them as paraphrases or not paraphrases

Dataset: [Quora Question Pairs](https://www.kaggle.com/quora/question-pairs-dataset)

Quora's first public dataset is related to the problem of identifying duplicate questions. At Quora, an important product principle is that there should be a single question page for each logically distinct question. For example, the queries “What is the most populous state in the USA?” and “Which state in the United States has the most people?” should not exist separately on Quora because the intent behind both is identical. 

In [None]:
import re
import os
import numpy as np
import pandas as pd

from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize

pd.set_option('max_colwidth', 80)

In [None]:
def clean_string(string):
    string = re.sub(r"[^A-Za-z ]", " ", string)  
    return string.strip().lower()

In [None]:
data = pd.read_csv('data/questions.csv', nrows=30000)
data = data.dropna()

data['question1'] = data['question1'].apply(clean_string)
data['question2'] = data['question2'].apply(clean_string)

data = data[['question1', 'question2', 'is_duplicate']]
data.columns = ['text1', 'text2', 'labels']

data = data[data['text1'].apply(lambda x: len(x) > 0) & (data['text2'].apply(lambda x: len(x) > 0))]

In [None]:
data.sample(5)

### Train/Dev/Test

In [None]:
# train/dev/test -> 70/15/15
data_splits = ('train', 'dev', 'test')

train, intermediate = train_test_split(data, test_size=0.3, random_state=24)
dev, test = train_test_split(intermediate, test_size=0.5, random_state=24)

# Baseline approaches


## Text representations
### Bag-of-Words

Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. Those word counts allow us to compare documents and gauge their similarities for applications like search, document classification and topic modeling.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import defaultdict

import scipy
from scipy.sparse import csr_matrix

$$
{\displaystyle {\text{similarity}}=\cos(\theta )={\mathbf {A} \cdot \mathbf {B}  \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},}
$$

In [None]:
def calculate_cosine_distance(textA, textB):
    textA = normalize(textA)
    textB = normalize(textB)
    if isinstance(textA, np.ndarray):
        dot_product = np.multiply(textA, textB).sum(axis=1).flatten()
    else:
        dot_product = np.array(textA.multiply(textB).sum(axis=1)).flatten()
    return 1 - dot_product

In [None]:
bow = CountVectorizer()
bow.fit(train['text1'].tolist() + train['text2'].tolist())

bow_data = {name: dict() for name in data_splits}

for d, name in zip((train, dev, test), data_splits):
    bow_data[name]['text1'] = bow.transform(d['text1'].tolist())
    bow_data[name]['text2'] = bow.transform(d['text2'].tolist())

In [None]:
bow_data['train']['text1']

In [None]:
bow_data['train']['text2']

In [None]:
def calculate_score(textA, textB, labels):
    cos_dists = calculate_cosine_distance(textA, textB)

    best_f1 = 0
    best_thres = None

    for thres in np.linspace(0, 2, num=50):
        f1 = f1_score((cos_dists < thres).astype(np.int32), labels)
        if f1 > best_f1:
            best_f1 = f1
            best_thres = thres
            
    return best_f1, best_thres

In [None]:
best_f1, best_thres = calculate_score(
    bow_data['dev']['text1'], 
    bow_data['dev']['text2'],
    dev['labels'].values
)

print(f'(DEV) F1 score = {best_f1}')

In [None]:
test_cos_dists = calculate_cosine_distance(bow_data['test']['text1'], bow_data['test']['text2'])

test_f1 = f1_score((test_cos_dists < best_thres).astype(np.int32), test['labels'].values)
print(f'(TEST) F1 score = {test_f1}')

## Tf-Idf

Term-frequency-inverse document frequency (TF-IDF) is another way to represent a text by the words it contains. With TF-IDF, words are given weight – TF-IDF measures relevance, not frequency. That is, wordcounts are replaced with TF-IDF scores across the whole dataset.

<img src="https://skymind.ai/images/wiki/tfidf.png">

In [None]:
tfidf = TfidfVectorizer()
tfidf.fit(train['text1'].tolist() + train['text2'].tolist())

tfidf_data = {name: dict() for name in data_splits}

for d, name in zip((train, dev, test), data_splits):
    tfidf_data[name]['text1'] = tfidf.transform(d['text1'].tolist())
    tfidf_data[name]['text2'] = tfidf.transform(d['text2'].tolist())

In [None]:
tfidf_data['train']['text1']

In [None]:
best_f1, best_thres = calculate_score(
    tfidf_data['dev']['text1'], 
    tfidf_data['dev']['text2'],
    dev['labels'].values
)

print(f'(DEV) F1 score = {best_f1}')

In [None]:
test_cos_dists = calculate_cosine_distance(tfidf_data['test']['text1'], tfidf_data['test']['text2'])
test_f1 = f1_score((test_cos_dists < best_thres).astype(np.int32), test['labels'].values)
print(f'(TEST) F1 score = {test_f1}')

### Tf-Idf on char n-grams

Very helpful if you work with russian language.

In [None]:
tfidf = TfidfVectorizer(analyzer='char', ngram_range=(3, 5))
tfidf.fit(train['text1'].tolist() + train['text2'].tolist())

tfidf_data = {name: dict() for name in data_splits}

for d, name in zip((train, dev, test), data_splits):
    tfidf_data[name]['text1'] = tfidf.transform(d['text1'].tolist())
    tfidf_data[name]['text2'] = tfidf.transform(d['text2'].tolist())

In [None]:
tfidf_data['train']['text1']

In [None]:
best_f1, best_thres = calculate_score(
    tfidf_data['dev']['text1'], 
    tfidf_data['dev']['text2'],
    dev['labels'].values
)

print(f'(DEV) F1 score = {best_f1}')

In [None]:
test_cos_dists = calculate_cosine_distance(tfidf_data['test']['text1'], tfidf_data['test']['text2'])
test_f1 = f1_score((test_cos_dists < best_thres).astype(np.int32), test['labels'].values)
print(f'(TEST) F1 score = {test_f1}')

## Neural Approaches -- fastText

The gist of fastText is that instead of directly learning a vector representation for a word (as with word2vec), we learn a representation for each character n-gram. Each word is represented as a bag of character n-grams, so the overall word embedding is a sum of these character n-grams.

fastText is a library whose purpose is to be used as a fast baseline for text embeddings/classification when deep learning approaches are just too slow and expensive.

In [None]:
from gensim.models import FastText

In [None]:
# full corpora

texts = train['text1'].tolist() + train['text2'].tolist()
texts = [text.split() for text in texts]

In [None]:
%%time

model = FastText(texts, size=300)

In [None]:
def text2vec(text, model=model):
    assert len(text) > 0

    vectors = []
    for word in text.split():
        try:
            vectors.append(model.wv[word])
        except KeyError:
            vectors.append(np.zeros(model.vector_size))

    return np.mean(vectors, axis=0)

In [None]:
fasttext_data = {name: dict() for name in data_splits}

for d, name in zip((train, dev, test), data_splits):
    fasttext_data[name]['text1'] = np.array([text2vec(t) for t in d['text1'].tolist()])
    fasttext_data[name]['text2'] = np.array([text2vec(t) for t in d['text2'].tolist()])

In [None]:
fasttext_data['train']['text1'].shape

In [None]:
best_f1, best_thres = calculate_score(
    fasttext_data['dev']['text1'], 
    fasttext_data['dev']['text2'],
    dev['labels'].values
)

print(f'(DEV) F1 score = {best_f1}')

In [None]:
test_cos_dists = calculate_cosine_distance(fasttext_data['test']['text1'], fasttext_data['test']['text2'])
test_f1 = f1_score((test_cos_dists < best_thres).astype(np.int32), test['labels'].values)
print(f'(TEST) F1 score = {test_f1}')

## Pre-trained fasttext

Learning word representation requires serious computational power and time. Since Facebook has done it for you, why not using that to boost productivity?

In [None]:
from gensim.models import KeyedVectors

In [None]:
# uncomment if you'd like to download (2.5Gb+)

# !wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
# !unzip wiki-news-300d-1M.vec.zip

In [None]:
wv_from_text = KeyedVectors.load_word2vec_format('data/wiki-news-300d-1M.vec')

In [None]:
fasttext_data = {name: dict() for name in data_splits}

for d, name in zip((train, dev, test), data_splits):
    fasttext_data[name]['text1'] = np.array([text2vec(t, wv_from_text) for t in d['text1'].tolist()])
    fasttext_data[name]['text2'] = np.array([text2vec(t, wv_from_text) for t in d['text2'].tolist()])

In [44]:
fasttext_data['train']['text1'].shape

(20996, 300)

In [202]:
best_f1, best_thres = calculate_score(
    fasttext_data['dev']['text1'], 
    fasttext_data['dev']['text2'],
    dev['labels'].values
)

print(f'(DEV) F1 score = {best_f1}')

(DEV) F1 score = 0.5973247232472325


In [203]:
test_cos_dists = calculate_cosine_distance(fasttext_data['test']['text1'], fasttext_data['test']['text2'])
test_f1 = f1_score((test_cos_dists < best_thres).astype(np.int32), test['labels'].values)
print(f'(TEST) F1 score = {test_f1}')

(TEST) F1 score = 0.6051258788841006


## How to handle texts?

Embeddings!

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

<img src="https://adriancolyer.files.wordpress.com/2016/04/word2vec-distributed-representation.png?w=656&zoom=2">

In [None]:
import torchtext
import torch
import torch.nn as nn

from collections import defaultdict

In [None]:
torch.__version__

In [None]:
torchtext.__version__

In [None]:
if not os.path.exists('data/paraphrase'):
    os.makedirs('data/paraphrase')

train.to_csv('data/paraphrase/train.csv', index=False)
dev.to_csv('data/paraphrase/dev.csv', index=False)
test.to_csv('data/paraphrase/test.csv', index=False)

In [None]:
device_name = 'cuda' if torch.cuda.is_available() else 'cpu'
device = torch.device(device_name)

In [None]:
class ParaphraseDataset:

    def __init__(self, path, is_classification=False, min_freq=2, batch_sizes=(64, 64, 64), device=device):
        self.path = path
        self.is_classification = is_classification
        self.min_freq = min_freq
        self.batch_sizes = batch_sizes
        self.device = device

        self.text_field1 = None
        self.text_field2 = None
        self.labels_field = None
        self.train_dataset, self.dev_dataset, self.test_dataset = None, None, None
        
        self.word2idx = None
        self.idx2word = None
        
        self.build_dataset()
        self.build_vocab()

    def build_dataset(self):

        self.text_field1 = torchtext.data.Field(
            sequential=True,
            batch_first=True,
            lower=True,
            preprocessing=None
        )

        self.labels_field = torchtext.data.Field(
            sequential=False,
            use_vocab=False,
            is_target=True,
            batch_first=True,
            dtype=torch.float32
        )

        if not self.is_classification:

            self.text_field2 = torchtext.data.Field(
                sequential=True,
                batch_first=True,
                lower=True,
                preprocessing=None
            )

            fields = [
                ('text1', self.text_field1),
                ('text2', self.text_field2),
                ('labels', self.labels_field)
            ]
        else:
            fields = [
                ('text1', self.text_field1),
                ('labels', self.labels_field)
            ]

        self.train_dataset, self.dev_dataset, self.test_dataset = torchtext.data.TabularDataset.splits(
            path=self.path,
            root='.',
            train='train.csv',
            validation='dev.csv',
            test='test.csv',
            format='csv',
            fields=fields,
            skip_header=True
        )

    def build_vocab(self):
        self.text_field1.build_vocab(self.train_dataset, min_freq=self.min_freq)
        
        if not self.is_classification:
            self.text_field2.build_vocab(self.train_dataset, min_freq=self.min_freq)
            
            self.word2idx = defaultdict(torchtext.vocab._default_unk_index)
            self.word2idx.update(dict(self.text_field1.vocab.stoi))
            
            for word, idx in self.text_field2.vocab.stoi.items():
                if word not in self.word2idx:
                    self.word2idx[word] = len(self.word2idx)
                else:
                    pass
            
            self.text_field1.vocab.stoi = self.word2idx
            self.text_field2.vocab.stoi = self.word2idx
        else:
            self.word2idx = dict(self.text_field1.vocab.stoi)
        
        self.idx2word = {idx: word for word, idx in self.word2idx.items()}
        print(f'Vocabulary size = {len(self.word2idx)}')
    
    def create_iterators(self):
        train_iter, dev_iter, test_iter = torchtext.data.Iterator.splits(
            datasets=(self.train_dataset, self.dev_dataset, self.test_dataset),
            batch_sizes=self.batch_sizes,
            shuffle=(False, False, False),
            sort=False,
            device=self.device
        )
        
        return train_iter, dev_iter, test_iter

In [None]:
para_dataset = ParaphraseDataset(path='data/paraphrase/')
train_iter, dev_iter, test_iter = para_dataset.create_iterators()

In [None]:
# batch = next(iter(train_iter))

for batch in train_iter:
    print(batch)
    break

In [None]:
batch.text1.shape, batch.text2.shape, batch.labels.shape

## Neural baseline

<img src="https://i.ibb.co/D7R7kNH/raai-pizza.png">

In [None]:
class EmbeddingLayer(nn.Module):

    def __init__(self, emb_dim, ntokens=len(para_dataset.word2idx), 
                 padding_idx=para_dataset.word2idx['<pad>']):
        super().__init__()

        self.emb_dim = emb_dim
        self.ntokens = ntokens
        self.emb = nn.Embedding(
            num_embeddings=self.ntokens,
            embedding_dim=self.emb_dim, 
            padding_idx=padding_idx
        )

    def forward(self, ids):

        x = self.emb(ids)

        return x

In [None]:
embedder = EmbeddingLayer(emb_dim=64)

embedder.to(device)

In [None]:
batch.text1.shape, batch.text2.shape

In [None]:
embeddings1 = embedder(batch.text1)
embeddings2 = embedder(batch.text2)

embeddings1.shape, embeddings2.shape

<img src="https://i0.wp.com/mlexplained.com/wp-content/uploads/2018/05/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-05-10-13.29.52.png?w=366">

In [None]:
class MeanPoolingOverTime(nn.Module):

    def __init__(self, dim=1):
        super().__init__()
        self.dim = dim

    def forward(self, x):
        return torch.mean(x, dim=self.dim)


class AveragingNetwork(nn.Module):

    def __init__(self, emb_dim=64, hidden_dim=32, output_dim=16):
        super().__init__()
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim

        self.feed_forward = nn.Sequential(
            MeanPoolingOverTime(),
            nn.Linear(in_features=self.emb_dim, out_features=self.hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(in_features=hidden_dim, out_features=self.output_dim),
        )

    def forward(self, embeds):

        hidden = self.feed_forward(embeds)

        return hidden

In [None]:
body = AveragingNetwork()

body.to(device)

In [None]:
hidden_vectors1 = body(embeddings1)
hidden_vectors2 = body(embeddings2)

hidden_vectors1.shape, hidden_vectors2.shape

In [None]:
class SimpleHead(nn.Module):
    def __init__(self, output_dim):
        super().__init__()

        self.output_dim = output_dim
        self.dense = nn.Linear(in_features=self.output_dim * 2, out_features=1)

    def forward(self, x, y):
        concatenated = torch.cat((x, y), dim=1)
        z = self.dense(concatenated)

        return z

In [None]:
head = SimpleHead(output_dim=16)

head.to(device)

In [None]:
# logits
output = head(hidden_vectors1, hidden_vectors2)

output.shape

In [None]:
# putting all together

class ModelHandler(nn.Module):
    def __init__(self, embedding_encoder, body_encoder, head_encoder):
        super().__init__()

        self.embedding_encoder = embedding_encoder
        self.body_encoder = body_encoder
        self.head_encoder = head_encoder

    def forward(self, text1, text2):

        hidden1 = self.predict_hidden(text1, aggregate=False)
        hidden2 = self.predict_hidden(text2, aggregate=False)

        output = self.head_encoder(hidden1, hidden2)

        if len(hidden1.size()) > 2:
            hidden1 = torch.mean(hidden1, dim=1)
            hidden2 = torch.mean(hidden2, dim=1)

        return output, (hidden1, hidden2)

    def predict_hidden(self, text, aggregate=True):

        embeds = self.embedding_encoder(text)
        hidden = self.body_encoder(embeds)

        if aggregate and len(hidden.size()) > 2:
            hidden = torch.mean(hidden, dim=1)

        return hidden
    
    def predict_attention_scores(self, context, query):
        hidden1 = self.predict_hidden(context, aggregate=False)
        hidden2 = self.predict_hidden(query, aggregate=False)

        scores = self.head_encoder.get_scores(hidden1, hidden2)
        
        return scores

In [None]:
baseline_model = ModelHandler(
    embedding_encoder=embedder,
    body_encoder=body, 
    head_encoder=head
)

baseline_model.to(device)

In [None]:
output, (hidden1, hidden2) = baseline_model(batch.text1, batch.text2)

output.shape, hidden1.shape, hidden2.shape

## Training process

In [None]:
from tensorboardX import SummaryWriter
from sklearn.metrics import f1_score

<img src="https://cdn-images-1.medium.com/max/1600/1*UJxVqLnbSj42eRhasKeLOA.png">

In [None]:
def calculate_f1(y_true: torch.Tensor, y_prob: torch.Tensor, 
                 thres: float = None, average: str = 'binary') -> float:

    y_prob = y_prob.detach().cpu().numpy()
    y_true = y_true.cpu().numpy()

    if thres is None:
        score = max([
            f1_score(y_true, (y_prob > thres).astype(int), average=average) for thres in np.linspace(0, 1)
        ])
    else:
        if average != 'binary':
            preds = y_prob
        else:
            preds = (y_prob > thres).astype(int)
        score = float(f1_score(y_true, preds, average=average))

    return score


def save_checkpoint(state_dict: dict, path: str, epoch: int) -> None:
    torch.save(state_dict, f'{path}/model_{epoch}')

In [None]:
num_epochs = 5
save_freq = 30


def write_metrics(writer, step, values):
    for name, value in values.items():
        writer.add_scalar(name, value, global_step=step)


def train_one_epoch(model_path, model, optimizer, iterator, writer, epoch):
    model.train()
    for step, batch in enumerate(iterator, start=(epoch - 1) * len(iterator)):

        optimizer.zero_grad()

        logits, (text1_hidden, text2_hidden) = model(batch.text1, batch.text2)
        loss = criterion(logits.squeeze(), batch.labels)
        loss.backward()
        optimizer.step()
        
        if step % save_freq == 0:
            f1 = calculate_f1(y_true=batch.labels, y_prob=torch.sigmoid(logits))
            write_metrics(writer, step, {'loss': loss.item(), 'f1': f1})
            
            print(f'[Train]  Epoch = {epoch}, Loss Value = {loss.item():.4f}, F1 score = {f1:.4f}')


def validate(model, iterator, writer=None, epoch=None, step=None):
    with torch.no_grad():
        loss_history = list()
        f1_history = list()
        for batch in iterator:
            logits, (text1_hidden, text2_hidden) = model(batch.text1, batch.text2)
            loss = criterion(logits.squeeze(), batch.labels)
            loss_history.append(loss.item())

            f1 = calculate_f1(y_true=batch.labels, y_prob=torch.sigmoid(logits))
            f1_history.append(f1)

        loss = np.mean(loss_history)
        f1 = np.mean(f1_history)

        if writer is not None:
            write_metrics(writer, step, {'loss': loss, 'f1': f1})
            print(f'>>>>>>> [Test]  Epoch = {epoch}, Loss Value = {loss:.4f}, F1 score = {f1:.4f}')
        else:
            return f1


def train_evaluate(model_path,
                   model,
                   optimizer, 
                   train_iter,
                   dev_iter=None,
                   num_epochs=num_epochs, 
                   save_freq=save_freq):
    
    train_writer = SummaryWriter(model_path)
    dev_writer = SummaryWriter(os.path.join(model_path, 'eval'))
    
    if dev_iter is not None:
        validate(model, dev_iter, dev_writer, epoch=0, step=0)

    for epoch in range(1, num_epochs + 1):
        
        train_one_epoch(model_path, model, optimizer, train_iter, train_writer, epoch)
        
        if dev_iter is not None:
            validate(model, dev_iter, dev_writer, epoch, step=(epoch * len(train_iter)))

        # save_checkpoint(model.state_dict(), str(experiment_dir), epoch)
        
    train_writer.close()
    dev_writer.close()

$$
\mathcal{L}=-\sum_{i=1}^{N}\left[y_{i} \log p_{i}+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right]
$$

In [None]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(baseline_model.parameters(), lr=0.003)

train_evaluate(
    model_path='experiments/model_baseline', 
    model=baseline_model, 
    optimizer=optimizer, 
    train_iter=train_iter, 
    dev_iter=dev_iter
)

## More complex approach

<img src="https://i.ibb.co/FXDqQbT/Screenshot-2019-07-05-at-09-51-08.png">

In [None]:
class SneakyHead(nn.Module):
    def __init__(self, output_dim):
        super().__init__()

        self.output_dim = output_dim
        self.dense = nn.Linear(in_features=self.output_dim * 4, out_features=1)

    def forward(self, x, y):
        emb_mul = torch.mul(x, y)
        emb_abs = torch.abs(x - y)
        concatenated = torch.cat([x, y, emb_abs, emb_mul], dim=1)
        z = self.dense(concatenated)

        return z

In [None]:
emb_dim = 128
hidden_dim = 64
output_dim = 32


sneaky_model = ModelHandler(
    embedding_encoder=EmbeddingLayer(emb_dim),
    body_encoder=AveragingNetwork(emb_dim, hidden_dim, output_dim), 
    head_encoder=SneakyHead(output_dim)
)

sneaky_model.to(device)

In [None]:
optimizer = torch.optim.Adam(sneaky_model.parameters(), lr=0.003)

train_evaluate(
    model_path='sneaky_model', 
    model=sneaky_model, 
    optimizer=optimizer, 
    train_iter=train_iter, 
    dev_iter=dev_iter
)

## Let's try attention!

But what is attention?

Attention is simply a vector, often the outputs of dense layer using softmax function.

<img src="http://jalammar.github.io/images/t/transformer_self-attention_visualization.png">

### Bilinear Attention

$$
\begin{aligned} s_{j}^{t} &=h_{j}^{q T} W_{b} h_{t}^{p} \\ a_{i}^{t} &=\exp \left(s_{i}^{t}\right) / \Sigma_{j=1}^{N} \exp \left(s_{j}^{t}\right) \\ q_{t}^{b} &=\Sigma_{i=1}^{N} a_{i}^{t} h_{i}^{q} \end{aligned}
$$

### Concat Attention

$$
\begin{aligned} s_{j}^{t} &=v_{c}^{T} \tanh \left(W_{c}^{1} h_{j}^{q}+W_{c}^{2} h_{t}^{p}\right) \\ a_{i}^{t} &=\exp \left(s_{i}^{t}\right) / \sum_{j=1}^{N} \exp \left(s_{j}^{t}\right) \\ q_{t}^{c} &=\Sigma_{i=1}^{N} a_{i}^{t} h_{i}^{q} \end{aligned}
$$

### Dot Attention

$$
\begin{aligned} s_{j}^{t} &=v_{d}^{T} \tanh \left(W_{d}\left(h_{j}^{q} \odot h_{t}^{p}\right)\right) \\ a_{i}^{t} &=\exp \left(s_{i}^{t}\right) / \Sigma_{j=1}^{N} \exp \left(s_{j}^{t}\right) \\ q_{t}^{d} &=\Sigma_{i=1}^{N} a_{i}^{t} h_{i}^{q} \end{aligned}
$$

### Minus Attention

$$
\begin{aligned} s_{j}^{t} &=v_{m}^{T} \tanh \left(W_{m}\left(h_{j}^{q}-h_{t}^{p}\right)\right) \\ a_{i}^{t} &=\exp \left(s_{i}^{t}\right) / \Sigma_{j=1}^{N} \exp \left(s_{j}^{t}\right) \\ q_{t}^{m} &=\Sigma_{i=1}^{N} a_{i}^{t} h_{i}^{q} \end{aligned}
$$

In [None]:
class BilinearAttention(nn.Module):
    # x^T W y

    def __init__(self, emb_dim):
        super().__init__()

        self.emb_dim = emb_dim
        self.W = nn.Linear(self.emb_dim, self.emb_dim, bias=False)

    def forward(self, context, query):
        scores = self.get_scores(context, query)
        output = torch.bmm(scores, context)

        return output
    
    def get_scores(self, context, query):
        contextW = self.W(context)
        scores = torch.bmm(contextW, query.transpose(1, 2))
        scores = torch.softmax(scores, dim=1).transpose(2, 1)
        
        return scores
    
    
class MinusAttention(nn.Module):
    # v^T tanh(W(x - y))

    def __init__(self, emb_dim):
        super().__init__()

        self.emb_dim = emb_dim
        self.W = nn.Linear(self.emb_dim, self.emb_dim, bias=False)
        self.v = nn.Linear(self.emb_dim, 1, bias=False)

    def forward(self, context, query):
        scores = self.get_scores(context, query)
        output = torch.bmm(scores.transpose(2, 1), context)

        return output
    
    def get_scores(self, context, query):
        batch_size, m, _ = context.size()
        k = query.size(1)

        context_ = context.repeat(1, k, 1)
        query_ = query.repeat_interleave(m, dim=1)
        minus = torch.sub(context_, query_)

        Wminus = self.W(minus)
        Wminus_tanh = torch.tanh(Wminus)

        scores = self.v(Wminus_tanh)
        scores = scores.reshape(batch_size, m, k)
        scores = torch.softmax(scores, dim=1)
        
        return scores


class ConcatAttention(nn.Module):
    # v^T tanh(W_1 x + W_2 y)

    def __init__(self, emb_dim):
        super().__init__()

        self.emb_dim = emb_dim
        self.W1 = nn.Linear(self.emb_dim, self.emb_dim, bias=False)
        self.W2 = nn.Linear(self.emb_dim, self.emb_dim, bias=False)
        self.v = nn.Linear(self.emb_dim, 1, bias=False)

    def forward(self, context, query):
        scores = self.get_scores(context, query)
        output = torch.bmm(scores.transpose(2, 1), context)

        return output
    
    def get_scores(self, context, query):
        batch_size, m, _ = context.size()
        k = query.size(1)

        context_ = context.repeat(1, k, 1)
        query_ = query.repeat_interleave(m, dim=1)

        W1context = self.W1(context_)
        W2query = self.W2(query_)
        Wsum_tanh = torch.tanh(W1context + W2query)

        scores = self.v(Wsum_tanh)
        scores = scores.reshape(batch_size, m, k)
        scores = torch.softmax(scores, dim=1)
        
        return scores


class DotAttention(nn.Module):
    # v^T tanh(W (x * y))

    def __init__(self, emb_dim):
        super().__init__()

        self.emb_dim = emb_dim
        self.W = nn.Linear(self.emb_dim, self.emb_dim, bias=False)
        self.v = nn.Linear(self.emb_dim, 1, bias=False)

    def forward(self, context, query):
        scores = self.get_scores(context, query)
        output = torch.bmm(scores.transpose(2, 1), context)

        return output
    
    def get_scores(self, context, query):
        batch_size, m, _ = context.size()
        k = query.size(1)

        context_ = context.repeat(1, k, 1)
        query_ = query.repeat_interleave(m, dim=1)
        dot = torch.mul(context_, query_)

        Wdot = self.W(dot)
        Wdot_tanh = torch.tanh(Wdot)

        scores = self.v(Wdot_tanh)
        scores = scores.reshape(batch_size, m, k)
        scores = torch.softmax(scores, dim=1)
        
        return scores


In [None]:
emb_dim = 64

att_mechanism = BilinearAttention(emb_dim=emb_dim)
att_mechanism.to(device)

In [None]:
batch_size = 32

x = torch.rand((batch_size, 3, emb_dim), device=device)
y = torch.rand((batch_size, 5, emb_dim), device=device)

# [batch_size, query_len, emb_dim]
att_mechanism(context=x, query=y).shape

In [None]:
att_mechanism(context=y, query=x).shape

<img src="https://miro.medium.com/max/1838/1*8nFrwolzTYtUWSaziiJGkg.png">

In [None]:
class SimpleLSTM(nn.Module):

    def __init__(self, emb_dim, hidden_dim, num_layers, bidirectional=False, aggregate=False):
        super().__init__()
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional
        self.aggregate = aggregate
        self.output_dim = self.hidden_dim * 2 if self.bidirectional else self.hidden_dim

        self.rnn = nn.LSTM(
            self.emb_dim,
            self.hidden_dim,
            num_layers=num_layers,
            bidirectional=self.bidirectional,
            batch_first=True
        )

    def forward(self, embeds):

        output, _ = self.rnn(embeds)

        if self.aggregate:
            output = torch.mean(output, dim=1)

        return output


In [None]:
class AttentionHead(nn.Module):
    def __init__(self, output_dim, attention_mechanism):
        super().__init__()

        self.output_dim = output_dim
        self.attention_mechanism = attention_mechanism(self.output_dim)
        self.dense = nn.Linear(in_features=self.output_dim * 8, out_features=1)

    def forward(self, x, y):
        new_x = self.attention_mechanism(context=y, query=x)
        new_y = self.attention_mechanism(context=x, query=y)
        
        x = torch.cat((new_x, x), dim=-1)
        y = torch.cat((new_y, y), dim=-1)

        x = torch.mean(x, dim=1)
        y = torch.mean(y, dim=1)
        
        emb_mul = torch.mul(x, y)
        emb_abs = torch.abs(x - y)
        
        concatenated = torch.cat([x, y, emb_mul, emb_abs], dim=1)
        z = self.dense(concatenated)

        return z
    
    def get_scores(self, x, y):
        scores = self.attention_mechanism.get_scores(x, y)
        return scores

In [None]:
emb_dim = 128

attention_model_baseline = ModelHandler(
    embedding_encoder=EmbeddingLayer(emb_dim=emb_dim), 
    body_encoder=nn.Identity(), 
    head_encoder=AttentionHead(
        output_dim=emb_dim, 
        attention_mechanism=BilinearAttention
    )
)

attention_model_baseline.to(device)

In [None]:
optimizer = torch.optim.Adam(attention_model_baseline.parameters(), lr=0.003)

train_evaluate(
    model_path='experiments/attention_model_baseline', 
    model=attention_model_baseline, 
    optimizer=optimizer, 
    train_iter=train_iter, 
    dev_iter=dev_iter
)

In [None]:
emb_dim = 128
hidden_dim = 64
num_layers = 1


attention_model = ModelHandler(
    embedding_encoder=EmbeddingLayer(emb_dim=emb_dim), 
    body_encoder=SimpleLSTM(emb_dim=emb_dim, hidden_dim=hidden_dim, num_layers=num_layers), 
    head_encoder=AttentionHead(
        output_dim=hidden_dim, 
        attention_mechanism=BilinearAttention
    )
)

attention_model.to(device)

In [None]:
optimizer = torch.optim.Adam(attention_model.parameters(), lr=0.003)

train_evaluate(
    model_path='experiments/attention_model', 
    model=attention_model, 
    optimizer=optimizer, 
    train_iter=train_iter, 
    dev_iter=dev_iter
)

Let's run

> tensorboard --logdir experiments

## Visualization

In [None]:
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

In [None]:
def text_to_tensor(text, dataset_handler=para_dataset):
    
    if isinstance(text, str):
        text = [text]
    
    field = ('x', dataset_handler.train_dataset.fields['text1'])
    examples = [torchtext.data.Example.fromlist([t], fields=[field]) for t in text]
    dataset = torchtext.data.Dataset(examples, fields=[field])

    iterator = torchtext.data.Iterator(
        dataset=dataset,
        batch_size=len(text),
        shuffle=False,
        device=device
    )
    
    return next(iter(iterator)).x


def plot_attention(query, context, att_weights, scale=True):
    tokens_a = query.split()
    tokens_b = context.split()
    
    assert len(tokens_a) == att_weights.shape[0]
    assert len(tokens_b) == att_weights.shape[1]
    
    if scale:
        mins = att_weights.min(axis=1)
        maxes = att_weights.max(axis=1)
        att_weights = (att_weights - mins.reshape(-1, 1))  / (maxes - mins).reshape(-1, 1)

    fig, ax = plt.subplots(figsize=(5, 5))
    ax.imshow(att_weights, cmap='gray')
    ax.set_xticks(np.arange(att_weights.shape[1]))
    ax.set_yticks(np.arange(att_weights.shape[0]))

    ax.set_xticklabels([word for word in tokens_b])
    ax.set_yticklabels([word for word in tokens_a])

    ax.tick_params(labelsize=16)
    ax.tick_params(axis='x', labelrotation=90)

    plt.show()
    
    
def get_attention_scores(model, textA, textB):
    with torch.no_grad():
        attn_scores = model.predict_attention_scores(
            text_to_tensor(textA), 
            text_to_tensor(textB)
        ).squeeze(0)

    attn_scores = attn_scores.detach().cpu().numpy()
    
    return attn_scores

In [None]:
ex = 2

textA = train.iloc[ex].text1
textB = train.iloc[ex].text2
lab = train.iloc[ex].labels

print(f'Label = {lab}')
print(f'Text1 = {textA}')
print(f'Text2 = {textB}')

attn_scores = get_attention_scores(attention_model, textA, textB)

plot_attention(textB, textA, attn_scores, scale=False)

# Paraphrase retriaval

In reality you'll need to solve a different task. The task is given a text find its paraphrase.

In [None]:
quora = pd.read_csv('data/quora_modified.csv')

In [None]:
quora.shape

In [None]:
min_lab, max_lab = quora['labels'].min(), quora['labels'].max()

min_lab, max_lab

In [None]:
def convert_neg_labels(lab):
    if lab == -1:
        lab = max_lab + 1
    return lab

quora['labels'] = quora['labels'].map(convert_neg_labels)

In [None]:
quora['text'] = quora['text'].apply(clean_string)

In [None]:
quora.head()

In [None]:
quora.loc[quora['labels'] == 93].head()

In [None]:
train, intermediate = train_test_split(quora, stratify=quora['labels'], test_size=0.3, random_state=24)
dev, test = train_test_split(intermediate, stratify=intermediate['labels'], test_size=0.5, random_state=24)

In [None]:
if not os.path.exists('data/classification'):
    os.makedirs('data/classification')

train.to_csv('data/classification/train.csv', index=False)
dev.to_csv('data/classification/dev.csv', index=False)
test.to_csv('data/classification/test.csv', index=False)

In [None]:
class_dataset = ParaphraseDataset('data/classification/', is_classification=True, batch_sizes=(64, 64, 1))
train_iter, dev_iter, test_iter = class_dataset.create_iterators()

In [None]:
for batch in train_iter:
    print(batch)
    break

In [None]:
class ClassificationModelHandler(nn.Module):
    def __init__(self, embedding_encoder, body_encoder, head_encoder):
        super().__init__()

        self.embedding_encoder = embedding_encoder
        self.body_encoder = body_encoder
        self.head_encoder = head_encoder

    def forward(self, text):
        
        embeds = self.embedding_encoder(text)
        hidden = self.body_encoder(embeds)
        output = self.head_encoder(hidden)
        return output

In [None]:
emb_dim = 128
hidden_dim = 64
output_dim = 32

class_model = ClassificationModelHandler(
    embedding_encoder=EmbeddingLayer(emb_dim),
    body_encoder=AveragingNetwork(emb_dim, hidden_dim, output_dim), 
    head_encoder=nn.Linear(in_features=output_dim, out_features=(max_lab + 2))
)

class_model.to(device)

In [None]:
output = class_model(batch.text1)

output.shape

In [None]:
# we need to modify our functions a little

def train_one_epoch_class(model_path, model, optimizer, iterator, writer, epoch):
    model.train()
    for step, batch in enumerate(iterator, start=(epoch - 1) * len(iterator)):

        optimizer.zero_grad()

        logits = model(batch.text1)
        loss = criterion(logits, batch.labels.long())
        probs = torch.softmax(logits, dim=1)
        loss.backward()
        optimizer.step()
        
        if step % save_freq == 0:
            f1 = calculate_f1(y_true=batch.labels, y_prob=torch.argmax(probs, dim=1), average='macro', thres=1)
            write_metrics(writer, step, {'loss': loss.item(), 'f1': f1})
            
            print(f'[Train]  Epoch = {epoch}, Loss Value = {loss.item():.4f}, F1 score = {f1:.4f}')


def validate_class(model, iterator, writer, epoch, step):
    with torch.no_grad():
        loss_history = list()
        f1_history = list()
        for batch in iterator:
            logits = model(batch.text1)
            probs = torch.softmax(logits, dim=1)
            y_pred = torch.argmax(probs, dim=1)
            f1 = calculate_f1(y_true=batch.labels, y_prob=y_pred, average='macro', thres=1)
            f1_history.append(f1)
    
            loss = criterion(logits, batch.labels.long())
            loss_history.append(loss.item())

        loss = np.mean(loss_history)
        f1 = np.mean(f1_history)
        write_metrics(writer, step, {'loss': loss, 'f1': f1})
        
        print(f'>>>>>>> [Test]  Epoch = {epoch}, Loss Value = {loss:.4f}, F1 score = {f1:.4f}')


def train_evaluate_class(model_path,
                   model,
                   optimizer, 
                   train_iter,
                   dev_iter=None,
                   num_epochs=num_epochs, 
                   save_freq=save_freq):
    
    train_writer = SummaryWriter(model_path)
    dev_writer = SummaryWriter(os.path.join(model_path, 'eval'))
    
    if dev_iter is not None:
        validate_class(model, dev_iter, dev_writer, epoch=0, step=0)

    for epoch in range(1, num_epochs + 1):
        
        train_one_epoch_class(model_path, model, optimizer, train_iter, train_writer, epoch)
        
        if dev_iter is not None:
            validate_class(model, dev_iter, dev_writer, epoch, step=(epoch * len(train_iter)))

        # save_checkpoint(model.state_dict(), str(experiment_dir), epoch)
        
    train_writer.close()
    dev_writer.close()

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(class_model.parameters(), lr=0.003)

train_evaluate_class(
    model_path='experiments/classification', 
    model=class_model, 
    optimizer=optimizer, 
    train_iter=train_iter, 
    dev_iter=dev_iter,
    num_epochs=15
)

### Classification dataset -> Paraphrase dataset

In [None]:
from tqdm import tqdm

In [None]:
for test_batch in tqdm(test_iter):
    probabilities = []
    labels = []
    for train_batch in train_iter:
        mask = (train_batch.labels != max_lab + 1)
        train_text = train_batch.text1[mask]
        train_labels = train_batch.labels[mask]

        test_text = torch.repeat_interleave(test_batch.text1, train_text.shape[0], dim=0)

        logits = attention_model(test_text, train_text)[0]
        probs = torch.sigmoid(logits)

        probabilities.append(probs.detach().cpu().numpy())
        labels.append(train_labels.cpu().numpy())
    
    probabilities = np.array(probabilities)
    labels = np.array(labels)

## Too long!

In [None]:
class_dataset = ParaphraseDataset('data/classification/', is_classification=True, batch_sizes=(1, 1, 1))
train_iter, dev_iter, test_iter = class_dataset.create_iterators()

x_train_texts = []
x_train = []
y_train = []

for train_batch in tqdm(train_iter):
    mask = (train_batch.labels != max_lab + 1)
    if mask.sum() > 0:
        train_text = train_batch.text1  # [mask]
        train_labels = train_batch.labels  # [mask]

        vectors = attention_model.predict_hidden(train_text, aggregate=True)
        x_train_texts.append(train_text.cpu().numpy())
        x_train.append(vectors.detach().cpu().numpy())
        y_train.append(train_labels.cpu().numpy())

x_train = np.vstack(x_train)
y_train = np.hstack(y_train)

x_test_texts = []
x_test = []
y_test = []

for test_batch in tqdm(test_iter):
    test_text = test_batch.text1
    test_labels = test_batch.labels
    
    vectors = attention_model.predict_hidden(test_text, aggregate=True)
    x_test_texts.append(test_text.cpu().numpy())
    x_test.append(vectors.detach().cpu().numpy())
    y_test.append(test_labels.cpu().numpy())
    
x_test = np.vstack(x_test)
y_test = np.hstack(y_test)

## Zero-Shot = Approx kNN

In [None]:
import faiss

In [None]:
k = 20

emb_size = x_train.shape[1]
faiss_index = faiss.IndexFlat(emb_size)

faiss_index.verbose = True
faiss_index.add(x_train)

predicted_probs = []
predicted_labels = []

for i in tqdm(range(y_test.shape[0])):
    _, indexes = faiss_index.search(x_test[i].reshape(1, -1), k=k)
    train_texts = [x_train_texts[q] for q in indexes[0]]
    train_labels = y_train[indexes[0]]
    
    test_text = x_test_texts[i]
    test_label = y_test[i]
    
    test_probs = []
    for j in range(k):
        train_text = train_texts[j]
        logits, _ = attention_model(
            torch.from_numpy(train_text).to(device),
            torch.from_numpy(test_text).to(device)
        )
        probs = torch.sigmoid(logits).detach().cpu().numpy()
        test_probs.append(probs[0][0])
        
    test_probs = np.array(test_probs)
    max_prob_idx = np.argmax(test_probs)
    max_prob = test_probs[max_prob_idx]
    pred_label = train_labels[max_prob_idx]
    
    predicted_probs.append(max_prob)
    predicted_labels.append(pred_label)
    
predicted_probs = np.array(predicted_probs)
predicted_labels = np.array(predicted_labels)

best_thres = None
best_f1 = 0

for thres in np.linspace(0, 1):

    y_pred = predicted_labels.copy()
    y_pred[y_pred < thres] = max_lab + 1

    f1 = f1_score(y_true=y_test, y_pred=y_pred, average='macro')
    if f1 > best_f1:
        best_f1 = f1
        best_thres = thres

print(f'F1 score = {best_f1}')

## Collect new dataset

In [None]:
def jaccard(text1, text2):
    vocab1 = set(text1.split())
    vocab2 = set(text2.split())
    int_size = len(vocab1 & vocab2)
    un_size = len(vocab1 | vocab2)
    if un_size > 0:
        return int_size / un_size
    else:
        return 0

In [None]:
# download train_new from here: https://yadi.sk/d/ProgN30MTkkEFQ
# or uncomment code below

# train_new = []
# num_pos = 5
# num_neg = 5

# for y in tqdm(train['labels'].unique()):
#     pos = train[train['labels'] == y]
#     neg = train[train['labels'] != y]
    
#     for _ in range(num_pos):
#         chosen_pos = np.random.choice(pos['text'].tolist(), 2, replace=False).tolist()
#         train_new.append(chosen_pos + [1])
        
#     for t in np.random.permutation(pos['text'].tolist())[:num_neg]:
#         for tn in np.random.permutation(neg['text'].tolist()):
#             jacc = jaccard(t, tn)
#             if (jacc > 0.01) and (jacc < 0.9):
#                 train_new.append([t, tn, 0])
#                 break
        
# train_new = pd.DataFrame(train_new, columns=['text1', 'text2', 'labels']).drop_duplicates()
# train_new = train_new.reset_index(drop=True)

In [None]:
# download train_new from here: https://yadi.sk/d/ProgN30MTkkEFQ
# or uncomment code above

train_new = pd.read_csv('data/new_paraphrase/train.csv')

In [None]:
if not os.path.exists('data/new_paraphrase'):
    os.makedirs('data/new_paraphrase')
    
train_new.to_csv('data/new_paraphrase/train.csv', index=False)
train_new[:100].to_csv('data/new_paraphrase/dev.csv', index=False)
train_new[:100].to_csv('data/new_paraphrase/test.csv', index=False)

In [None]:
para_dataset = ParaphraseDataset(path='data/new_paraphrase/')
train_iter, dev_iter, test_iter = para_dataset.create_iterators()

In [None]:
emb_dim = 128
hidden_dim = 64
num_layers = 1


attention_model = ModelHandler(
    embedding_encoder=EmbeddingLayer(emb_dim=emb_dim), 
    body_encoder=nn.Identity(), 
    head_encoder=AttentionHead(
        output_dim=emb_dim, 
        attention_mechanism=BilinearAttention
    )
)

attention_model.to(device)

In [None]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(attention_model.parameters(), lr=0.003)

train_evaluate(
    model_path='experiments/attention_model_new', 
    model=attention_model, 
    optimizer=optimizer, 
    train_iter=train_iter, 
    dev_iter=dev_iter, 
    num_epochs=6
)

## Now we repeat

In [None]:
class_dataset = ParaphraseDataset('data/classification/', is_classification=True, batch_sizes=(1, 1, 1))
train_iter, dev_iter, test_iter = class_dataset.create_iterators()

x_train_texts = []
x_train = []
y_train = []

for train_batch in tqdm(train_iter):
    mask = (train_batch.labels != max_lab + 1)
    if mask.sum() > 0:
        train_text = train_batch.text1  # [mask]
        train_labels = train_batch.labels  # [mask]

        vectors = attention_model.predict_hidden(train_text, aggregate=True)
        x_train_texts.append(train_text.cpu().numpy())
        x_train.append(vectors.detach().cpu().numpy())
        y_train.append(train_labels.cpu().numpy())

x_train = np.vstack(x_train)
y_train = np.hstack(y_train)

x_test_texts = []
x_test = []
y_test = []

for test_batch in tqdm(test_iter):
    test_text = test_batch.text1
    test_labels = test_batch.labels
    
    vectors = attention_model.predict_hidden(test_text, aggregate=True)
    x_test_texts.append(test_text.cpu().numpy())
    x_test.append(vectors.detach().cpu().numpy())
    y_test.append(test_labels.cpu().numpy())
    
x_test = np.vstack(x_test)
y_test = np.hstack(y_test)

In [None]:
k = 20

emb_size = x_train.shape[1]
faiss_index = faiss.IndexFlat(emb_size)
faiss_index.verbose = True
faiss_index.add(x_train)

predicted_probs = []
predicted_labels = []

for i in tqdm(range(y_test.shape[0])):
    _, indexes = faiss_index.search(x_test[i].reshape(1, -1), k=k)
    train_texts = [x_train_texts[q] for q in indexes[0]]
    train_labels = y_train[indexes[0]]
    
    test_text = x_test_texts[i]
    test_label = y_test[i]
    
    test_probs = []
    for j in range(k):
        train_text = train_texts[j]
        logits, _ = attention_model(
            torch.from_numpy(train_text).to(device),
            torch.from_numpy(test_text).to(device)
        )
        probs = torch.sigmoid(logits).detach().cpu().numpy()
        test_probs.append(probs[0][0])
        
    test_probs = np.array(test_probs)
    max_prob_idx = np.argmax(test_probs)
    max_prob = test_probs[max_prob_idx]
    pred_label = train_labels[max_prob_idx]
    
    predicted_probs.append(max_prob)
    predicted_labels.append(pred_label)
    
predicted_probs = np.array(predicted_probs)
predicted_labels = np.array(predicted_labels)

best_thres = None
best_f1 = 0
for thres in np.linspace(0, 1):

    y_pred = predicted_labels.copy()
    y_pred[y_pred < thres] = max_lab + 1

    f1 = f1_score(y_true=y_test, y_pred=y_pred, average='macro')
    if f1 > best_f1:
        best_f1 = f1
        best_thres = thres

print(f'F1 score = {best_f1}')

# How to improve?

* include triplet loss
* gain more data
* complex architecture

# Reference

* [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)
* [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://arxiv.org/pdf/1705.02364.pdf)
* [Multiway Attention Networks for Modeling Sentence Pairs](https://www.ijcai.org/proceedings/2018/0613.pdf)
* [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
* [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
* [Attention and Augmented Recurrent Neural Networks](https://distill.pub/2016/augmented-rnns/)