## Practice: Dealing with Word Embeddings
##### Credits: embeddings visualization is based on the notebook by [YSDA NLP course](https://github.com/yandexdataschool/nlp_course)

Today we gonna play with word embeddings: train our own little embedding, load one from   gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

__Requirements:__ if you're running locally, in the selected environment run the following command:

```pip install --upgrade nltk gensim bokeh umap-learn```

In [None]:
import string

import numpy as np
import umap
from IPython.display import clear_output
from matplotlib import pyplot as plt
from nltk.tokenize import WordPunctTokenizer

Download the data (alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw)

In [None]:
import sys


if "google.colab" in sys.modules:
    !wget "https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1" -O ./quora.txt -nc

In [None]:
data = list(open("./quora.txt", encoding="utf-8"))
data[50]

__Tokenization:__ a typical first step for an nlp task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many nlp tasks like tokenization, stemming or part-of-speech tagging.

In [None]:
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

In [None]:
# TASK: lowercase everything and extract tokens with tokenizer.
# data_tok should be a list of lists of tokens for each line in data.

data_tok = [tokenizer.tokenize(x.lower()) for x in data]

Let's peek at the result:

In [None]:
" ".join(data_tok[0])

Small check that everything is alright

In [None]:
assert all(
    isinstance(row, (list, tuple)) for row in data_tok
), "please convert each line into a list of tokens (strings)"
assert all(
    all(isinstance(tok, str) for tok in row) for row in data_tok
), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all("a" <= x.lower() <= "z" for x in tok)  # noqa: E731
assert all(
    map(lambda l: not is_latin(l) or l.islower(), map(" ".join, data_tok))
), "please make sure to lowercase the data"

__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings. 

The choice is huge, so let's start someplace small: __gensim__ is another NLP library that features many vector-based models incuding word2vec.

In [None]:
from gensim.models import Word2Vec


model = Word2Vec(
    data_tok,
    vector_size=32,  # embedding vector size
    min_count=5,  # consider words that occured at least 5 times
    window=5,
).wv  # define context as a 5-word window around the target word

In [None]:
# now you can get word vectors!
model.get_vector("cat")

In [None]:
# or query similar words directly. Go play with it!
model.most_similar("rat")

### Using pre-trained model

Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: wikipedia articles or twitter posts. 

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise).

In [None]:
import gensim.downloader as api


model = api.load("glove-twitter-25")

In [None]:
model.most_similar(positive=["stick"], negative=["tree"])

### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [None]:
model.sort_by_descending_frequency()
words = list(model.key_to_index.keys())[:1000]

print(words[::100])

In [None]:
# for each word, compute it's vector with model
word_vectors = np.asarray([model[x] for x in words])

In [None]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), 25)
assert np.isfinite(word_vectors).all()

In [None]:
word_vectors.shape

#### Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


pca = PCA(2)
scaler = StandardScaler()
# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
word_vectors_pca = pca.fit_transform(word_vectors)
word_vectors_pca = scaler.fit_transform(word_vectors_pca)
# and maybe MORE OF YOUR CODE here :)

In [None]:
assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

#### Let's draw it!

In [None]:
import bokeh.models as bm
import bokeh.plotting as pl
from bokeh.io import output_notebook


output_notebook()


def draw_vectors(
    x, y, radius=10, alpha=0.25, color="blue", width=600, height=400, show=True, **kwargs
):
    """draws an interactive plot for data points with auxilirary info on hover"""
    if isinstance(color, str):
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({"x": x, "y": y, "color": color, **kwargs})

    fig = pl.figure(active_scroll="wheel_zoom", width=width, height=height)
    fig.scatter("x", "y", size=radius, color="color", alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show:
        pl.show(fig)
    return fig

In [None]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Visualizing neighbors with UMAP
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use UMAP, which is itself an embedding method. Here you can read __[more on UMAP (ru)](https://habr.com/ru/company/newprolab/blog/350584/)__ and on __[t-SNE](https://distill.pub/2016/misread-tsne/)__, which is also an embedding.

In [None]:
embedding = umap.UMAP(n_neighbors=5).fit_transform(word_vectors)

In [None]:
draw_vectors(embedding[:, 0], embedding[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Visualizing phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!


In [None]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    vector = np.zeros([model.vector_size], dtype="float32")
    phrase_tokenized = tokenizer.tokenize(phrase.lower())
    phrase_vectors = [model[x] for x in phrase_tokenized if model.has_index_for(x)]

    if len(phrase_vectors) != 0:
        vector = np.mean(phrase_vectors, axis=0)
    return vector

In [None]:
data[402687]

In [None]:
get_phrase_embedding(data[402687])

In [None]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")

In [None]:
vector

In [None]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[:: len(data) // 1000]

# compute vectors for chosen phrases and turn them to numpy array
phrase_vectors = np.asarray([get_phrase_embedding(x) for x in chosen_phrases])  # YOUR CODE

In [None]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), model.vector_size)

In [None]:
# map vectors into 2d space with pca, tsne or your other method of choice
# don't forget to normalize

phrase_vectors_2d = umap.UMAP(n_neighbors=3).fit_transform(phrase_vectors)
# phrase_vectors_2d = pca.fit_transform(phrase_vectors)

# phrase_vectors_2d = scaler.fit_transform(phrase_vectors_2d)
# phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)

In [None]:
draw_vectors(
    phrase_vectors_2d[:, 0],
    phrase_vectors_2d[:, 1],
    phrase=[phrase[:50] for phrase in chosen_phrases],
    radius=20,
)

Finally, let's build a simple "similar question" engine with phrase embeddings we've built.

In [None]:
# compute vector embedding for all lines in data
data_vectors = np.vstack([get_phrase_embedding(line) for line in data])

In [None]:
norms = np.linalg.norm(data_vectors, axis=1)

In [None]:
printable_set = set(string.printable)

In [None]:
data_subset = [x for x in data if set(x).issubset(printable_set)]

In [None]:
def find_nearest(query, k=10):
    """
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    hint: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    """
    query_vector = get_phrase_embedding(query)
    dists = data_vectors.dot(query_vector[:, None])[:, 0] / (
        (norms + 1e-16) * np.linalg.norm(query_vector)
    )
    nearest_elements = dists.argsort(axis=0)[-k:][::-1]
    out = [data[i] for i in nearest_elements]
    return out

In [None]:
results = find_nearest(query="How do I stay active?", k=5)

print("".join(results))

# assert len(results) == 10 and isinstance(results[0], str)
# assert results[0] == 'How do I get to the dark web?\n'
# assert results[3] == 'What can I do to save the world?\n'

In [None]:
results = find_nearest(query="How do i enter the matrix?", k=10)

print("".join(results))

assert len(results) == 10 and isinstance(results[0], str)
assert results[0] == "How do I get to the dark web?\n"
assert results[3] == "What can I do to save the world?\n"

In [None]:
find_nearest(query="Why don't i ask a question myself?", k=10)

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(3)

In [None]:
labels = kmeans.fit_predict(np.asarray(phrase_vectors))

In [None]:
_colors = ["red", "green", "blue"]

In [None]:
draw_vectors(
    phrase_vectors_2d[:, 0],
    phrase_vectors_2d[:, 1],
    color=[_colors[label] for label in labels],
    phrase=[phrase[:50] for phrase in chosen_phrases],
    radius=20,
)

In [None]:
plt.figure(figsize=(12, 10))
plt.scatter(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1], c=labels.astype(float))

__Now what?__
* Try running TSNE instead of UMAP (it takes a long time)
* Try running UMAP or TSNEon all data, not just 1000 phrases
* See what other embeddings are there in the model zoo: `gensim.downloader.info()`
* Take a look at [FastText](https://github.com/facebookresearch/fastText) embeddings
* Optimize find_nearest with locality-sensitive hashing: use [nearpy](https://github.com/pixelogik/NearPy) or `sklearn.neighbors`.




### Extra: your own word2vec

In [None]:
import itertools

import torch
import torch.autograd as autograd  # noqa: F401
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim  # noqa: F401
from torch.optim.lr_scheduler import ReduceLROnPlateau, StepLR  # noqa: F401

In [None]:
vocabulary = set(itertools.chain.from_iterable(data_tok))

word_to_index = None  # YOUR CODE HERE
index_to_word = None  # YOUR CODE HERE
word_counter = {word: 0 for word in word_to_index.keys()}

Generating context pairs:

In [None]:
context_tuple_list = []
w = 4

for text in data_tok:
    for i, word in enumerate(text):
        first_context_word_index = max(0, i - w)
        last_context_word_index = min(i + w, len(text))
        for j in range(first_context_word_index, last_context_word_index):
            if i != j:
                context_tuple_list.append((word_to_index[word], word_to_index[text[j]]))
                word_counter[word] += 1.0
print("There are {} pairs of target and context words".format(len(context_tuple_list)))

Casting everything to `torch.LongTensor`

In [None]:
data_torch = torch.tensor(context_tuple_list).type(torch.LongTensor)
X_torch = data_torch[:, 0]
y_torch = data_torch[:, 1]
del data_torch

In [None]:
class Word2VecModel(nn.Module):
    def __init__(self, embedding_size, vocab_size):
        super(Word2VecModel, self).__init__()
        # YOUR CODE HERE

    def forward(self, context_word):
        # YOUR CODE HERE
        pass

In [None]:
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
model = Word2VecModel(25, len(word_to_index)).to(device)

In [None]:
loss_func = nn.CrossEntropyLoss()
opt = torch.optim.Adam(model.parameters(), lr=0.01)
# To reduce learning rate on plateau of the loss functions
lr_scheduler = ReduceLROnPlateau(opt, patience=35)

In [None]:
loss_func(model(X_torch[:5].to(device)), y_torch[:5].to(device))

In [None]:
batch_size = 1024
n_iterations = 1000
local_train_loss_history = []

In [None]:
def plot_train_process(train_loss):
    fig, axes = plt.subplots(1, 1, figsize=(15, 5))

    axes.set_title("Loss")
    axes.plot(train_loss, label="train")
    axes.legend()
    plt.show()

In [None]:
for i in range(n_iterations):

    ix = np.random.randint(0, len(context_tuple_list), batch_size)
    x_batch = X_torch[ix].to(device)
    y_batch = y_torch[ix].to(device)

    # YOUR CODE HERE: predict log-probabilities or logits

    # YOUR CODE HERE: compute loss, just like before
    loss = None

    # YOUR CODE HERE: compute gradients

    # YOUR CODE HERE: Adam step

    # YOUR CODE HERE: clear gradients

    local_train_loss_history.append(loss.item())
    lr_scheduler.step(local_train_loss_history[-1])

    if i % 100 == 0:
        clear_output(True)
        plot_train_process(local_train_loss_history)

In [None]:
matrix = next(model.embeddings.parameters()).detach().cpu()

In [None]:
def get_closest(word, top_n):
    global matrix, word_to_index, index_to_word
    y = matrix[word_to_index[word]][None, :]

    dist = F.cosine_similarity(matrix, y)
    index_sorted = torch.argsort(dist)
    top_n = index_sorted[-top_n:]
    return [index_to_word[x] for x in top_n.numpy()]

In [None]:
get_closest("apple", 5)

It might look not so promising. Remember about the upgrades to word2vec: subsampling and negative sampling.