# Pretrained Word Vectors

## Google Word2Vec

You can download google's pretrained wordvectors trained on Google news data from <a href="https://code.google.com/archive/p/word2vec/">this</a> link. 

In [None]:
!gunzip  /content/GoogleNews-vectors-negative300.bin.gz

In [None]:
import gensim
import warnings
warnings.filterwarnings("ignore")

In [None]:
googlew2v_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [10]:
googlew2v_model.wv['movie']

array([ 0.17480469, -0.10986328, -0.20019531,  0.26757812, -0.06396484,
        0.06689453,  0.07958984,  0.08398438,  0.12695312,  0.11621094,
        0.11523438, -0.13867188, -0.08203125, -0.00143433, -0.19824219,
        0.13574219, -0.03955078,  0.06933594, -0.2265625 , -0.20019531,
        0.03076172,  0.16015625, -0.04174805,  0.00427246,  0.09619141,
       -0.03320312,  0.02783203,  0.02124023,  0.13867188, -0.02075195,
       -0.31835938, -0.08837891, -0.23828125,  0.02490234,  0.06787109,
       -0.18066406,  0.27148438,  0.16210938,  0.04614258,  0.20410156,
        0.22949219, -0.03710938,  0.140625  ,  0.12890625, -0.22558594,
        0.03857422, -0.01300049,  0.00582886,  0.23144531,  0.1015625 ,
       -0.10351562, -0.10351562, -0.2578125 ,  0.16503906,  0.03686523,
       -0.32421875,  0.02893066, -0.11914062, -0.19238281,  0.00086594,
        0.06591797,  0.265625  , -0.15917969,  0.26171875, -0.18359375,
        0.13085938, -0.25      , -0.05541992,  0.27929688, -0.06

In [11]:
googlew2v_model.wv.most_similar(positive="movie")

[('film', 0.8676770329475403),
 ('movies', 0.8013108968734741),
 ('films', 0.7363011837005615),
 ('moive', 0.6830361485481262),
 ('Movie', 0.6693680286407471),
 ('horror_flick', 0.6577848196029663),
 ('sequel', 0.657779335975647),
 ('Guy_Ritchie_Revolver', 0.650975227355957),
 ('romantic_comedy', 0.6413198709487915),
 ('flick', 0.6321909427642822)]

## GloVe Pretrained Embeddings
You can download the glove embedding from [this](https://nlp.stanford.edu/projects/glove/) link. There are some differences between Google Word2vec save format and GloVe save format. We can convert Glove format to google format and then load that using gensim as below.

In [16]:
!unzip "/content/glove.42B.300d.zip"

Archive:  /content/glove.42B.300d.zip
  inflating: glove.42B.300d.txt      


In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="glove.42B.300d.txt", word2vec_output_file="w2vstyle_glove_vectors.txt")

glove_model = gensim.models.KeyedVectors.load_word2vec_format("w2vstyle_glove_vectors.txt", binary=False)

In [18]:
glove_model.wv['movie']

array([-4.2075e-01, -1.4467e-01,  1.0191e-01,  2.0241e-01, -1.4567e-01,
       -1.1941e-01, -2.4700e+00,  3.3624e-02, -1.2006e-01, -4.6816e-01,
        7.1301e-01, -5.9439e-02,  1.2095e+00,  9.6810e-01, -3.2963e-01,
        8.7278e-02, -3.7333e-01, -9.3263e-02,  1.0973e-01, -7.1390e-02,
       -2.2664e-01, -2.6468e-01, -1.3868e-01, -3.0817e-01, -4.3989e-01,
        1.9845e-01,  4.5981e-02, -6.0768e-02, -1.6476e-01,  2.2074e-01,
       -2.6332e-01,  5.8367e-01,  2.5667e-01,  5.6293e-01, -4.5794e-01,
        3.4421e-01,  2.3349e-01, -1.4443e-01, -6.8497e-01,  2.3049e-01,
       -2.3430e-01,  9.5162e-02, -1.2030e+00,  5.3600e-01, -8.0814e-02,
       -8.1808e-02,  6.0079e-02, -2.1561e-01, -4.7038e-01, -2.8741e-01,
       -1.4882e-01,  2.7626e-01,  5.2747e-02, -2.9150e-01, -2.5470e-02,
        3.3785e-01,  1.2429e-02,  3.5526e-01,  3.3341e-01,  6.5088e-01,
        1.5721e-01,  8.8008e-02,  9.7392e-01, -3.9372e-01,  1.5102e-01,
       -2.4143e-01,  5.8701e-01, -6.2534e-01,  1.4371e-01, -6.35

In [19]:
glove_model.wv.most_similar(positive="movie")

[('movies', 0.8332912921905518),
 ('film', 0.7633395195007324),
 ('films', 0.7153980731964111),
 ('starring', 0.6549824476242065),
 ('dvd', 0.6517331600189209),
 ('flick', 0.6402507424354553),
 ('soundtrack', 0.6395875215530396),
 ('trailer', 0.6362124681472778),
 ('cinema', 0.6266158819198608),
 ('picture', 0.6238757371902466)]

In [None]:
del glove_model

## FastText Pretrained Embeddings
You can get the fasttext wordembeedings from [this](https://fasttext.cc/docs/en/crawl-vectors.html) link. You can use fasttext python api or gensim to load the model. I am using gensim.

In [1]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

--2020-04-22 10:14:34--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 2606:4700:10::6816:4b8e, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4503593528 (4.2G) [application/octet-stream]
Saving to: ‘cc.en.300.bin.gz’


2020-04-22 10:21:00 (11.2 MB/s) - ‘cc.en.300.bin.gz’ saved [4503593528/4503593528]



In [None]:
!gunzip  /content/cc.en.300.bin.gz

In [None]:
from gensim.models import FastText
fasttext_model = FastText.load_fasttext_format("/content/cc.en.300.bin")

In [4]:
fasttext_model.wv['movie']

array([-5.81209641e-03,  9.73904878e-02,  1.04595488e-02,  5.26866540e-02,
       -3.82720456e-02, -3.02855000e-02,  4.67608534e-02, -1.06743231e-01,
       -2.80689672e-02,  7.72105828e-02,  2.16982178e-02, -1.52001321e-01,
        1.32803321e-01,  2.02166960e-02, -7.05633610e-02,  5.50531000e-02,
       -2.79935598e-02, -1.39033943e-01, -7.21509978e-02,  3.93662788e-02,
       -6.86735809e-02,  8.05060491e-02,  1.06284015e-01, -6.39171377e-02,
       -4.26706411e-02, -4.15812656e-02, -8.94204527e-02,  1.00058243e-02,
       -3.51400971e-02,  1.59504384e-01,  7.65532535e-03,  1.23118579e-01,
       -5.39162569e-02, -8.93288776e-02,  2.27064770e-02,  2.14390028e-02,
       -4.57377657e-02, -2.46021561e-02, -1.00255758e-01,  2.49300152e-04,
       -1.03573985e-02,  8.80190730e-02, -3.35501134e-02, -1.01518229e-01,
       -9.00789425e-02, -1.54590141e-03, -2.96002999e-03, -4.15874347e-02,
        5.23288697e-02,  1.68570846e-01,  7.45731890e-02,  5.10579571e-02,
       -1.25696901e-02,  

In [7]:
fasttext_model.wv.most_similar(positive="movie")

[('film', 0.7731738090515137),
 ('movies', 0.7638393640518188),
 ('movie.But', 0.7434740662574768),
 ('movie.The', 0.7382540702819824),
 ('movie.So', 0.7321995496749878),
 ('movie.Now', 0.7312403917312622),
 ('movie.This', 0.7194931507110596),
 ('movie.What', 0.7097904682159424),
 ('movie--', 0.709675669670105),
 ('movie.And', 0.7086042165756226)]