In this notebook, let us see how we can represent text using pre-trained word embedding models.

## 1. Using a pre-trained word2vec model
Let us take an example of a pre-trained word2vec model, and how we can use it to look for most similar words. We will use the Google News vectors embeddings. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

A few other pre-trained word embedding models, and details on the means to access them through gensim can be found in: https://github.com/RaRe-Technologies/gensim-data

In [1]:
import os
import wget
import gzip
import shutil

import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore") 

In [2]:
from gensim.models import Word2Vec, KeyedVectors
import gensim

w2v_model = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)

In [3]:
#Let us examine the model by knowing what the most similar words are, for a given word!
w2v_model.most_similar('beautiful')

[('gorgeous', 0.8353003263473511),
 ('lovely', 0.8106936812400818),
 ('stunningly_beautiful', 0.7329413294792175),
 ('breathtakingly_beautiful', 0.7231340408325195),
 ('wonderful', 0.6854085922241211),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576842308044),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402888298035)]

In [5]:
#Let us try with another word! 
w2v_model.most_similar('toronto')

[('montreal', 0.6984111666679382),
 ('vancouver', 0.6587257385253906),
 ('nyc', 0.6248832941055298),
 ('alberta', 0.6179691553115845),
 ('boston', 0.611499547958374),
 ('calgary', 0.61032634973526),
 ('edmonton', 0.6100260615348816),
 ('canadian', 0.5944076180458069),
 ('chicago', 0.5911980867385864),
 ('springfield', 0.5888352394104004)]

In [6]:
#What is the vector representation for a word? 
w2v_model['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

In [7]:
#What if I am looking for a word that is not in this vocabulary?
w2v_model['practicalnlp']

KeyError: "Key 'practicalnlp' not present"

### Two things to note while using pre-trained models:
Tokens/Words are always lowercased. If a word is not in the vocabulary, the model throws an exception.
So, it is always a good idea to encapsulate those statements in try/except blocks.

## 2. Getting the embedding representation for full text
We have seen how to get embedding vectors for single words. How do we use them to get such a representation for a full text? A simple way is to just sum or average the embeddings for individual words. We will see an example of this using Word2Vec in Chapter 4. Let us see a small example using another NLP library Spacy - which we saw earlier in Chapter 2 too.

In [30]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
     ---------------------------------------- 42.8/42.8 MB 6.5 MB/s eta 0:00:00
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.5.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [32]:
import spacy

# import en_core_web_md
# nlp = en_core_web_md.load()

nlp = spacy.load('en_core_web_md')
# process a sentence using the model
mydoc = nlp("India is a large country")

In [34]:
#Get a vector for individual words
print(mydoc[0].vector) #vector for 'India', the first word in the text 

[-1.1011    1.7973    5.014     4.3489    7.229    -1.109     2.3511
 -3.0402    0.42372   0.25007   5.5072    1.9201   -2.14      1.542
  3.5712    1.4904    2.6017    2.1846   -0.88781  -4.9147    2.3158
  1.7719   -4.5053    0.87238   2.5795   -3.9663   -3.195    -5.4856
  0.58449  -3.633     6.5806   -2.814    -0.8018   -2.2385   -2.5967
 -0.36951   2.495     2.1553   -5.1234   -3.8011   -0.66563   3.3713
 -0.40796   1.6779   -0.81356  -1.0663    0.73857   2.109     0.96269
 -3.0482   -1.559     4.37     -0.61864  -4.5239   -1.7338    2.9717
 -0.65025  -5.0166   -0.90173   2.5771   -0.63705   0.073243  0.54366
  1.9573    2.8241    1.866    -4.2535    0.4187    2.3898   -1.1184
  2.0804    2.8429    2.1111    1.5752    0.98231   0.74594  -2.5984
  0.12858   3.4576    1.1289    2.7284    1.2478   -0.85998  -3.922
 -1.3507   -2.8447    0.5663   -6.858     0.84215   1.0103   -1.8259
  3.4112   -1.6      -3.5925   -1.9004    3.419     2.0865    1.4145
  5.3398   -0.69086   4.0109    1.

In [35]:
print(mydoc.vector) #Averaged vector for the entire sentence

[-1.85136580e+00  3.95195818e+00 -8.90240073e-01  2.26851988e+00
  6.53990030e+00 -4.44057852e-01  1.00946796e+00  4.53139973e+00
  1.75119996e+00  6.00323856e-01  1.07610798e+01  1.04354000e+00
 -2.13014007e+00  2.87600040e-01  9.28449988e-01  5.29252338e+00
  1.96775985e+00  3.73944020e+00  4.05126035e-01 -1.19980407e+00
  2.31595993e+00  8.48540008e-01 -4.63511944e+00  1.70241797e+00
 -8.08275998e-01 -2.36681986e+00 -1.37362003e+00 -5.79943991e+00
 -1.71975398e+00 -2.97378993e+00  1.20051599e+00  1.54037988e+00
 -1.85977209e+00 -2.67313004e+00 -2.95536375e+00  1.32889986e+00
  9.22280014e-01  8.59005928e-01  2.08222008e+00 -2.40376019e+00
 -1.07127595e+00  1.90140605e+00  1.21326804e+00  2.49010012e-01
 -2.32581210e+00  1.53779995e+00  2.89083576e+00 -1.61278379e+00
 -8.34007919e-01  1.38488007e+00 -1.87076414e+00  1.20948195e+00
  2.64277005e+00 -4.82089996e+00 -3.97879928e-01 -2.96300024e-01
  2.54758406e+00 -1.08623910e+00 -2.45025843e-01  3.01799953e-01
  2.26964617e+00 -6.01079

In [40]:
#What happens when I give a sentence with strange words (and stop words), and try to get its word vector in Spacy?
temp = nlp('practicalnlp is a newword')
temp[0].vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [44]:
temp[2].vector

array([ -9.3629  ,   9.2761  ,  -7.2708  ,   4.3879  ,  10.316   ,
        -6.8469  ,   1.5755  ,   7.9405  ,   8.0812  ,   2.6194  ,
        17.189   ,   5.1028  ,  -3.5406  ,  -4.9522  ,   0.50726 ,
         7.3238  ,   8.4197  ,   3.4544  ,   0.83204 ,   5.5205  ,
         5.4937  ,   1.4897  ,  -2.2788  ,   4.497   ,   2.3909  ,
        -9.1051  ,  -6.827   ,  -3.8575  ,  -3.2794  ,  -6.6986  ,
         0.14048 ,  -2.2132  ,   3.5909  ,  -1.7824  ,  -6.5155  ,
         0.23331 ,   5.4186  , -11.212   ,  10.805   ,  -9.3444  ,
        -3.3625  ,  -1.3998  ,   3.5529  ,  -2.6246  ,   2.5553  ,
        -1.855   ,  -3.7859  ,   0.29584 ,  -2.5838  ,   1.6739  ,
        -1.6049  ,  -0.27709 ,   1.507   ,  -5.5291  ,  -2.1429  ,
        -1.7092  ,   8.389   ,  -1.856   ,  -5.4558  ,  -6.679   ,
         0.36212 ,   0.11176 ,   1.1457  ,  -3.2409  ,  -9.434   ,
         1.106   ,  -6.3912  , -13.735   ,   4.9788  ,   3.9198  ,
         0.031058,   4.3147  ,  -6.6471  ,   1.3955  ,  -2.595