## Tutorial on Word Embeddings
- We will be using a package called `pymagnitude` for accessing different word vectors
- Pymagnitude provides a fast and easy access to many different word vector models (e.g. word2vec, Glove, fastText, ELMo etc.) through a single API. 

## Steps
- First install `pymagnitude` through `pip`
- Then download your desired word vector model files from pymagnitude's site. **[Github/PlasticityAI/Magnitude](https://github.com/plasticityai/magnitude)**
- `pymagnitude` is a python only package. So if you want to use these word vectors in some other language, you would need to read them through raw txt files. (available through model's official sites)

### Setup Environment

#### Choose Python 3 + GPU/CPU

<img src="https://i.stack.imgur.com/khwGc.png" width="400"></img>
<img src="https://i.stack.imgur.com/5iL6w.png" width="400"></img>

#### Install and Import `pymagnitude`

In [1]:
!pip install pymagnitude

Collecting pymagnitude
[?25l  Downloading https://files.pythonhosted.org/packages/0a/a3/b9a34d22ed8c0ed59b00ff55092129641cdfa09d82f9abdc5088051a5b0c/pymagnitude-0.1.120.tar.gz (5.4MB)
[K     |████████████████████████████████| 5.4MB 1.4MB/s 
[?25hBuilding wheels for collected packages: pymagnitude
  Building wheel for pymagnitude (setup.py) ... [?25l[?25hdone
  Created wheel for pymagnitude: filename=pymagnitude-0.1.120-cp36-cp36m-linux_x86_64.whl size=135918206 sha256=96f1e50c35928e926fee3f55a7ccec3e9f90c9973fbf6fdc07c0d92699bda287
  Stored in directory: /root/.cache/pip/wheels/a2/c7/98/cb48b9db35f8d1a7827b764dc36c5515179dc116448a47c8a1
Successfully built pymagnitude
Installing collected packages: pymagnitude
Successfully installed pymagnitude-0.1.120


In [0]:
from pymagnitude import *

## Glove

### Download magnitude file

In [3]:
# Downloading glove
!wget http://magnitude.plasticity.ai/glove/light/glove.6B.100d.magnitude

--2019-08-11 04:45:27--  http://magnitude.plasticity.ai/glove/light/glove.6B.100d.magnitude
Resolving magnitude.plasticity.ai (magnitude.plasticity.ai)... 52.216.169.234
Connecting to magnitude.plasticity.ai (magnitude.plasticity.ai)|52.216.169.234|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 193871872 (185M) [binary/octet-stream]
Saving to: ‘glove.6B.100d.magnitude’


2019-08-11 04:45:30 (55.4 MB/s) - ‘glove.6B.100d.magnitude’ saved [193871872/193871872]



### Loading a downloaded .magnitude file

In [0]:
glove_vectors = Magnitude("glove.6B.100d.magnitude")

In [5]:
print("Number of words in glove's vocab:", len(glove_vectors))

Number of words in glove's vocab: 400000


In [6]:
print('cat' in glove_vectors)

True


In [12]:
print('peoplekind' in glove_vectors)

False


### Querying word vectors

In [9]:
glove_vectors.query('cat')

array([ 0.0458157,  0.0561247,  0.1253741, -0.1178949, -0.1162836,
        0.1255229,  0.0484232, -0.0279959,  0.0120681, -0.1567276,
       -0.0577499,  0.0283511,  0.1434202,  0.0405372,  0.0279204,
        0.195973 ,  0.1042463,  0.0193391,  0.1750634,  0.1016427,
        0.0797806,  0.0420077, -0.0026013, -0.1421145,  0.1099097,
        0.227253 , -0.1747141, -0.0996484, -0.045272 ,  0.0047397,
        0.0212727,  0.0166171,  0.1091715,  0.1160455,  0.1504489,
        0.0906988, -0.0555651,  0.0500564,  0.1368538, -0.1209926,
        0.0388505,  0.0087728, -0.0617861, -0.136578 , -0.0450875,
        0.0916493, -0.1531199,  0.0202567,  0.1104038,  0.0133782,
       -0.1135213,  0.0470996,  0.0936039,  0.1642385, -0.0580694,
       -0.2663456, -0.0197005,  0.0558389,  0.0825588,  0.0210009,
        0.1234354,  0.1775955, -0.0465261,  0.1018967,  0.1972073,
        0.2350715, -0.0324727,  0.0409837,  0.1465556,  0.0477426,
       -0.1914406,  0.0267516, -0.0014384,  0.0655168, -0.0245

### Obtain vectors for sentence words
- In a single API call obtain word vectors for every single word

In [0]:
s_vec = glove_vectors.query(["cats", "are", "very", "smart"])

In [28]:
s_vec.shape

(4, 100)

In [29]:
s_vec[0] # vector for first word "cat"

array([ 0.0534746,  0.0658763,  0.1870054, -0.1878241, -0.1361039,
        0.1787706,  0.0368147,  0.0105236, -0.0734371, -0.1664719,
       -0.0395147, -0.0076599,  0.087623 ,  0.05196  , -0.0100253,
        0.1080831, -0.0112173,  0.106295 ,  0.0333455,  0.0879857,
        0.1233711,  0.0152736, -0.0354984, -0.1272288,  0.0567415,
        0.1129181, -0.1014143, -0.1334674, -0.1390061,  0.0345092,
       -0.0179415,  0.0141827,  0.0668635,  0.0697162,  0.111911 ,
       -0.0337717,  0.0091931,  0.0793069, -0.0256063, -0.0376869,
        0.0379446,  0.0084574, -0.1493006, -0.1214661,  0.0539345,
        0.079186 , -0.1207227,  0.0129748,  0.0047   , -0.0642646,
       -0.0621019,  0.017952 , -0.0511136,  0.1135683, -0.0300667,
       -0.1121112, -0.0498845, -0.0999791, -0.1068441, -0.0355618,
       -0.0082395,  0.1947247, -0.0898908,  0.0357125,  0.1992089,
        0.2702968, -0.0970293, -0.0141555,  0.1273795, -0.0308398,
       -0.1807668, -0.0352942,  0.0518114,  0.0710662, -0.1095

### Word Similarity

In [20]:
glove_vectors.similarity("storm", ["cat", "typhoon"])

[0.31261414, 0.7601956]

### Find out most similar word

In [23]:
glove_vectors.most_similar_to_given("cat", ["dog", "pig", "laptop", "universe"])

'dog'

### Important Note on OOV words
- OOV - Out of Vocabulary
- Word vector models are always trained on a finite text corpus. Hence the vocab will also be finite in size.
- `pymagnitude` models handles OOV keys by assigning random (rather `deterministic random`) vectors to such words.
- `Medium` and `Large` magnitude models have advanced OOV handling such that similar keys in terms of spelling are placed closer in vector space. 
- Example follows

#### Light Version with no special OOV treatment

In [36]:
"uber" in glove_vectors

(True, True)

In [25]:
"uberification" in glove_vectors

False

In [39]:
glove_vectors.similarity("uber", "uberification")

[-0.012241118, -0.08139361]

#### Medium Version with advanced OOV treatment

In [27]:
# Downloading glove
!wget http://magnitude.plasticity.ai/glove/medium/glove.6B.100d.magnitude -O glove.6B.100d.magnitude.medium

--2019-08-11 05:05:07--  http://magnitude.plasticity.ai/glove/medium/glove.6B.100d.magnitude
Resolving magnitude.plasticity.ai (magnitude.plasticity.ai)... 52.216.200.82
Connecting to magnitude.plasticity.ai (magnitude.plasticity.ai)|52.216.200.82|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 301989888 (288M) [binary/octet-stream]
Saving to: ‘glove.6B.100d.magnitude.medium’


2019-08-11 05:05:15 (41.0 MB/s) - ‘glove.6B.100d.magnitude.medium’ saved [301989888/301989888]



In [0]:
glove_medium = Magnitude("glove.6B.100d.magnitude.medium")

In [32]:
"uberification" in glove_medium

False

In [33]:
glove_medium.similarity("uber", "uberification")

0.9184026091035166

In [47]:
# But use this with caution! Doesn't always work.
glove_medium.similarity("ultimate", "ultimete")

0.04616803565754191

### FastText
- Uses subword units
- Can often obtain good representation for rare words also

In [53]:
# Downloading fastText
!wget http://magnitude.plasticity.ai/fasttext/light/wiki-news-300d-1M-subword.magnitude

--2019-08-11 05:19:56--  http://magnitude.plasticity.ai/fasttext/light/wiki-news-300d-1M-subword.magnitude
Resolving magnitude.plasticity.ai (magnitude.plasticity.ai)... 52.216.226.98
Connecting to magnitude.plasticity.ai (magnitude.plasticity.ai)|52.216.226.98|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1397719040 (1.3G) [binary/octet-stream]
Saving to: ‘wiki-news-300d-1M-subword.magnitude’


2019-08-11 05:20:23 (50.9 MB/s) - ‘wiki-news-300d-1M-subword.magnitude’ saved [1397719040/1397719040]



In [0]:
ft_vectors = Magnitude("wiki-news-300d-1M-subword.magnitude")

In [58]:
ft_vectors.similarity("bat", "batgirl")

0.5256843

In [60]:
# Whereas glove fails to capture this similarity
glove_vectors.similarity("bat", "batgirl")

-0.029738491

### Other word vector models

<img width="600" align="middle" src="https://i.stack.imgur.com/OUbwp.png"></img>

In [0]:
# Downloading word2vec
!wget http://magnitude.plasticity.ai/word2vec/light/GoogleNews-vectors-negative300.magnitude