# Thinking in tensors, writing in PyTorch

A hands-on course by [Piotr Migdał](https://p.migdal.pl) (2019). Version 0.2.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]( https://colab.research.google.com/github/stared/thinking-in-tensors-writing-in-pytorch/blob/master/extra/Word%20vectors.ipynb)


## Word vectors

### Reading

For a general reading, see:

* [king - man + woman is queen; but why?](https://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)
* [Word2vec in PyTorch](https://adoni.github.io/2017/11/08/word2vec-pytorch/)

### Notes

We use the smallest, 50-dimensional, uncased GloVe word embedding:

* [GloVe: Global Vectors for Word Representation by Stanford](https://nlp.stanford.edu/projects/glove/)

Other popular pre-trained word embeddings:

* [word2vec by Google](https://code.google.com/archive/p/word2vec/)
* [fastText by Facebook](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md) (multilingual)

See also:

* [Aligning the fastText vectors of 78 languages](https://github.com/Babylonpartners/fastText_multilingual)
* [gensim-data](https://github.com/RaRe-Technologies/gensim-data) - data repository for pretrained NLP models and NLP corpora.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## Loading data

In [None]:
wv = pd.read_csv("./data/glove.6B.50d.txt",
                      delimiter=" ", header=None, index_col=0, quoting=3)

In [None]:
wv.head()

In [None]:
wv.loc["julia"].values

In [None]:
# let's make it nicer!
def latex_vector(series, first=3, last=1):
    from IPython.display import Math
    
    if len(series) < first + last:
        raise Exception("len(series) < first + last")
    
    s = r"\vec{v}_{\text{" +  series.name + r"}} = ["
    
    vs_fmtd = ["{:.2f}".format(v) for v in series.values[:first]]
    if len(series) > first + last:
        vs_fmtd.append(r"\ldots")
    vs_fmtd += ["{:.2f}".format(v) for v in series.values[-last:]]
            
    s += ", ".join(vs_fmtd)
    s += "]"

    return Math(s) 

In [None]:
latex_vector(wv.loc["julia"])

## Words close to each other

In [None]:
words = set(wv.index)

In [None]:
"daniel" in words

In [None]:
correlations = wv.loc[["cat", "dog", "bar", "pub", "beer", "tea", "coffee", "talked", "nicely"]].transpose().corr()
sns.clustermap(correlations, vmin=-1., vmax=1., cmap="coolwarm")

In [None]:
correlations = wv.loc[["hotel", "motel", "guesthouse", "bar", "pub", "party"]].transpose().corr()
sns.clustermap(correlations, vmin=-1., vmax=1., cmap="coolwarm")

## Projections on word differences

In [None]:
np.dot(wv.loc["kate"],  wv.loc["he"] - wv.loc["she"])

In [None]:
np.dot(wv.loc["john"],  wv.loc["he"] - wv.loc["she"])

In [None]:
names = ["kate", "catherine", "john", "mark", "peter", "anna", "julia", "jacob", "jake",
         "richard", "ted", "theodore", "sue", "susanne", "suzanne", "susan", "mary",
         "leo", "leonard", "alexander", "alexandra", "alex", "sasha"]
all([name in words for name in names])

In [None]:
gender = wv.loc["he"] - wv.loc["she"]

In [None]:
wv.loc[names].dot(gender).sort_values()

In [None]:
wv.loc[names].dot(gender).sort_values().plot.barh()

In [None]:
diminutive = wv.loc["kate"] - wv.loc["catherine"]

In [None]:
proj = pd.DataFrame([gender, diminutive], index=["gender", "diminutive"]).transpose()

In [None]:
df_plot = wv.loc[names].dot(proj).sort_values(by="diminutive")
df_plot

In [None]:
# let's normalize data
lens = (wv**2).sum(axis=1)
wvn = wv.div(np.sqrt(lens), axis='index')

In [None]:
some_words = ["good", "bad", "ok", "not", "ugly", "beautiful", "awesome", "!", "?"]
assert(all([word in words for word in some_words]))

awesomeness = wvn.loc["awesome"] - wvn.loc["awful"]
wvn.loc[some_words].dot(awesomeness).sort_values()

## Plots

To reduce dimensions, we use:

* [PCA](http://setosa.io/ev/principal-component-analysis/) - Principal Component Analysis
* [t-SNE](https://lvdmaaten.github.io/tsne/) - t-Distributed Stochastic Neighbor Embedding

See also [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/) at Distill.

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(wv.loc[names])

In [None]:
plt.plot(X_pca[:, 0], X_pca[:, 1], '.')
for i, name in enumerate(names):
    plt.annotate(name, X_pca[i])

In [None]:
tsne = TSNE(n_components=2, perplexity=3.)
X_tsne = tsne.fit_transform(wv.loc[names])

In [None]:
plt.plot(X_tsne[:, 0], X_tsne[:, 1], '.')
for i, name in enumerate(names):
    plt.annotate(name, X_tsne[i])

## Analogies

In [None]:
# normalize your data
wv.dot(wv.loc["dog"]).sort_values(ascending=False).head(10)

In [None]:
wvn.dot(wvn.loc["dog"]).sort_values(ascending=False).head(20)

In [None]:
wvn.dot(wvn.loc["dog"]).sort_values(ascending=False).tail(20)

In [None]:
wvn.dot(wvn.loc["king"] - wvn.loc["man"] + wvn.loc["woman"]).sort_values(ascending=False).head(20)

In [None]:
wvn.dot(wvn.loc["kissed"] - wvn.loc["kiss"] + wvn.loc["eat"]).sort_values(ascending=False).head(20)

## Extremes

In [None]:
temp_diff = wvn.loc["hot"] - wvn.loc["cold"]
temp_avg = (wvn.loc["hot"] + wvn.loc["cold"]) / 2.
proj = pd.DataFrame([temp_diff, temp_avg], index=["temp_diff", "temp_avg"]).transpose()

In [None]:
temp_all = wvn.dot(proj).sort_values(by="temp_avg", ascending=False)

In [None]:
temp_all.head(20)

In [None]:
temp_all.head(20).sort_values(by="temp_diff", ascending=False)

## Other notes

* ['unk' in GloVe is not for UNKNOWN](https://stackoverflow.com/questions/49239941/what-is-unk-in-the-pretrained-glove-vector-files-e-g-glove-6b-50d-txt)