## Seminar part 1: Fun with Word Embeddings

Today we gonna play with word embeddings: train our own little embedding, load one from   gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

__Requirements:__  `pip install --upgrade nltk gensim bokeh` , but only if you're running locally.

In [1]:
#pip install --upgrade nltk gensim bokeh

In [2]:
# download the data:
#!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

In [1]:
import numpy as np

data = list(open("/home/iris/paulshab/ShadLab6/quora.txt",  "r", encoding="utf-8"))
data[50]

"What TV shows or books help you read people's body language?\n"

# __Tokenization:__
a typical first step for an nlp task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many nlp tasks like tokenization, stemming or part-of-speech tagging.

**Tokenazation Library**


In [2]:
import nltk 
from nltk import tokenize
from nltk.tokenize import WordPunctTokenizer

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/iris/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/iris/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


### Task 1

In [4]:
# TASK: lowercase everything and extract tokens with tokenizer. 
# data_tok should be a list of lists of tokens for each line in data.

#data_tok = # YOUR CODE

def tokenize_and_to_lower(Series):
    
    text = []
    for row in Series:
        row = row.lower()
        text.append(tokenizer.tokenize(row))
    return text

data_tok = tokenize_and_to_lower(data)
data_tok[50]

['what',
 'tv',
 'shows',
 'or',
 'books',
 'help',
 'you',
 'read',
 'people',
 "'",
 's',
 'body',
 'language',
 '?']

In [5]:
assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

In [6]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


# Word Vectors
as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings. 

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec.

In [7]:
from gensim.models import Word2Vec

model = Word2Vec(data_tok, 
                 vector_size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5).wv  # define context as a 5-word window around the target word

In [26]:
# now you can get word vectors !
model.get_vector('anything')

(100,)

In [58]:
# or query similar words directly. Go play with it!
model.most_similar('cosmos')

[('syfy', 0.6056286096572876),
 ('discovery', 0.6002330183982849),
 ('prometheus', 0.5881201028823853),
 ('kosmos', 0.5825245976448059),
 ('continuum', 0.579316258430481),
 ('atlas', 0.5557728409767151),
 ('arc', 0.5545325875282288),
 ('studios', 0.5519287586212158),
 ('sci-fi', 0.549038290977478),
 ('hobbit', 0.5487554669380188)]

### Using pre-trained model

Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: wikipedia articles or twitter posts. 

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise).

In [10]:
import gensim.downloader as api
model = api.load('glove-twitter-100')

In [11]:
model.most_similar(positive=["coder", "money"], negative=["brain"])

[('broker', 0.5820155739784241),
 ('bonuses', 0.5424473285675049),
 ('banker', 0.5385112762451172),
 ('designer', 0.5197198390960693),
 ('merchandising', 0.4964233338832855),
 ('treet', 0.4922019839286804),
 ('shopper', 0.4920562207698822),
 ('part-time', 0.4912828207015991),
 ('freelance', 0.4843311905860901),
 ('aupair', 0.4796452522277832)]

### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [12]:
# words = sorted(model.index_to_key(), 
#                key=lambda word: model.vocab[word].count,
#                reverse=True)[:1000]


words = list(model.index_to_key)[:1000]

print(words[::100])

['<user>', '_', 'please', 'apa', 'justin', 'text', 'hari', 'playing', 'once', 'sei']


In [13]:
print(len(words))

1000


### Task 2

In [14]:
# for each word, compute it's vector with model
word_vectors = []
for word in words:
    #print(word)
    word_vectors.append(model.get_vector(word))
    
len(word_vectors)
word_vectors = np.asarray(word_vectors)

In [15]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), 100)
assert np.isfinite(word_vectors).all()

In [16]:
word_vectors[0].shape

(100,)

## Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;

In [17]:
from sklearn.decomposition import PCA

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
#word_vectors_pca = # YOUR CODE

# and maybe MORE OF YOUR CODE here :)

def normalize(P):
    Q = P - np.mean(P)                # recenter
    Q = Q/(P.std(axis=0))
    return Q

pca = PCA(n_components=2)
word_vectors_pca = pca.fit_transform(word_vectors)

word_vectors_pca = normalize(word_vectors_pca)
word_vectors_pca 

array([[ 0.38816625,  0.29139748],
       [ 0.3032961 ,  0.21081626],
       [ 0.4915255 ,  0.35504422],
       ...,
       [ 1.0563147 , -1.3328588 ],
       [-0.03258466,  0.28135958],
       [-0.83395237, -0.1277347 ]], dtype=float32)

In [18]:
assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

#### Let's draw it!

In [19]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [20]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

**Visualizing neighbors with t-SNE**
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [21]:
from sklearn.manifold import TSNE

# map word vectors onto 2d plane with TSNE. hint: use verbose=100 to see what it's doing.
# normalize them as just lke with pca


word_tsne = TSNE(verbose=100).fit_transform(word_vectors)

word_tsne = normalize(word_tsne)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1000 samples in 0.006s...
[t-SNE] Computed neighbors for 1000 samples in 0.139s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1000
[t-SNE] Mean sigma: 1.716133
[t-SNE] Computed conditional probabilities in 0.046s
[t-SNE] Iteration 50: error = 68.3700409, gradient norm = 0.3172764 (50 iterations in 64.124s)
[t-SNE] Iteration 100: error = 70.1028595, gradient norm = 0.2873822 (50 iterations in 100.244s)
[t-SNE] Iteration 150: error = 69.3876038, gradient norm = 0.2864398 (50 iterations in 120.520s)
[t-SNE] Iteration 200: error = 69.4906769, gradient norm = 0.2898864 (50 iterations in 109.343s)
[t-SNE] Iteration 250: error = 68.9809799, gradient norm = 0.3028231 (50 iterations in 87.106s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 68.980980
[t-SNE] Iteration 300: error = 1.1983070, gradient norm = 0.0026329 (50 iterations in 60.525s)
[t-SNE] Iteration 350: error = 1.1012346, gradient norm 

In [22]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)

### Visualizing phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!

## Task 3

In [52]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros
    
    phrase = phrase.lower()
    phrase = tokenizer.tokenize(phrase)
    
    empty_v = np.zeros([model.vector_size], dtype='float32')
    
    #word_vector = np.zeros([model.vector_size], dtype='float32')
    word_vector = []
    
    for word in phrase:
        if word in model.index_to_key:  
            my_vector = model.get_vector(word)
            word_vector.append(my_vector)
#         else:
#              word_vector.append(empty_v)
    
    word_vector = np.array(word_vector)
    return word_vector
        
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")
vector = vector.mean(axis=0)
vector.shape

(100,)

In [53]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")
vector = vector.mean(axis=0)

assert np.allclose(vector[::10],
                   np.array([ 0.31807372, -0.02558171,  0.0933293 , -0.1002182 , -1.0278689 ,
                             -0.16621883,  0.05083408,  0.17989802,  1.3701859 ,  0.08655966],
                              dtype=np.float32))

## Task 4

In [54]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

len(chosen_phrases)
# compute vectors for chosen phrases

phrase_vectors = []

for phrase in chosen_phrases:
    phrase_vectors.append(get_phrase_embedding(phrase))

len(phrase_vectors)

phrase_vectors = np.asarray(phrase_vectors)
phrase_vectors.shape

model.get_vector('hello').shape

(100,)

In [55]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), model.vector_size)

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [None]:
# map vectors into 2d space with pca, tsne or your other method of choice
# don't forget to normalize

phrase_vectors_2d = TSNE(verbose=1000).fit_transform(phrase_vectors)

phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)

In [None]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)