## Introduction

Word Embeddings are used to represent words in a multi-dimensional vector form. A word <span class="math"><b>w<sub>i</sub></b></span> in vocabulary **V** is represented in the form of a vector of **n** dimensions. These vectors are generated by unsupervised training on a large corpus of words to gain the semantic similarities between them. 

### Downloading the required vector file

In [40]:
import sys

In [69]:
# # Make a directory to store the vector files
# !mkdir check

# # Now download the different files using these commands
# # This may take a while
# !cd check && curl -O https://github.com/panditu2015/panditu2015.github.io/blob/master/AL_Slides.pdf

## Static Word Embeddings
Static WEs have fixed vector value for each word. They loose the contextual information

### Stanford's GloVe

In [56]:
# Import some libraries

import pymagnitude as pym
import numpy as np
import matplotlib.pyplot as plt

In [62]:
vectors = pym.Magnitude("./vectors/wiki-news-300d-1M.magnitude")
# vectors = pym.Magnitude("./vectors/glove.6B.50d.magnitude")

In [64]:
print("Vector Name: {}\nTotal words: {}\nDimension: {}".format("fastText", len(vectors), vectors.dim))

Vector Name: fastText
Total words: 999994
Dimension: 300


In [66]:
for i, (key, vec) in enumerate(vectors):
    if i == 1000:
    # if i == np.random.randint(12):
        print("i = {}\nKey: {}\nVector: {}".format(i, key, vec.shape))
        break

i = 1000
Key: function
Vector: (300,)


In [67]:
vectors.query("cat")

array([-5.814220e-02,  9.683100e-03, -5.948800e-03,  1.273135e-01,
        1.775960e-02,  2.496770e-02,  1.333492e-01, -5.610130e-02,
        1.120290e-02,  2.735600e-03,  2.548880e-02, -2.540190e-02,
        1.502400e-02, -1.354770e-02, -7.520700e-02,  1.346100e-03,
        1.284425e-01, -3.386920e-02, -7.382000e-04, -6.135540e-02,
       -7.008320e-02, -5.102090e-02, -4.667870e-02,  1.389510e-02,
       -7.737810e-02, -8.554100e-03, -2.674800e-02,  7.403460e-02,
       -2.066890e-02,  3.282710e-02, -1.389500e-03, -6.044350e-02,
        1.016080e-02,  2.822430e-02, -1.971360e-02, -8.284930e-02,
        4.216280e-02,  2.323080e-02, -2.744280e-02,  1.671750e-02,
       -4.268390e-02,  1.075999e-01, -1.710830e-02, -3.686530e-02,
       -3.256660e-02,  4.298800e-03, -2.422950e-02, -7.177670e-02,
       -4.081680e-02,  6.991000e-03,  2.557560e-02, -3.994830e-02,
       -2.907108e-01, -7.711760e-02,  4.086020e-02, -4.146810e-02,
        1.104223e-01,  3.343500e-02, -8.771300e-03, -4.077330e

In [68]:
doc = vectors.query(["I", "read", "a", "book"])
doc.shape

(4, 300)

## Static Word Embeddings

### Word2Vec
### GloVe
### fastText

## Contextual Word Embeddings

### ELMo
### BERT

In [70]:
elmo_vecs = pym.Magnitude('./vectors/elmo_2x1024_128_2048cnn_1xhighway_weights.magnitude')

In [71]:
sentence  = elmo_vecs.query(["play", "some", "music", "on", "the", "living", "room", "speakers", "."])
# Returns: an array of size (9 (number of words) x 768 (3 ELMo components concatenated))
unrolled = elmo_vecs.unroll(sentence)
# Returns: an array of size (3 (each ELMo component) x 9 x 256 (the number of dimensions for each ELMo component))

In [81]:
# unrolled.shape
# type(unrolled)
mean_vecs = unrolled.mean(axis=0)

In [82]:
mean_vecs.shape

(9, 256)

In [83]:
sentence.shape

(9, 768)