# word2vec: How To Implement word2vec

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [1]:
# Install gensim
!pip install -U gensim

Collecting gensim
  Downloading gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 2.8 MB/s eta 0:00:01    |▍                               | 327 kB 2.3 MB/s eta 0:00:11     |██████████                      | 7.6 MB 2.9 MB/s eta 0:00:06     |█████████████████████▌          | 16.3 MB 2.5 MB/s eta 0:00:04     |██████████████████████          | 16.7 MB 2.5 MB/s eta 0:00:04     |█████████████████████████████▎  | 22.1 MB 2.3 MB/s eta 0:00:01     |███████████████████████████████▎| 23.7 MB 2.8 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-4.2.0.tar.gz (119 kB)
[K     |████████████████████████████████| 119 kB 4.0 MB/s eta 0:00:01
Using legacy 'setup.py install' for smart-open, since package 'wheel' is not installed.
Installing collected packages: smart-open, gensim
    Running setup.py install for smart-open ... [?25ldone
[?25hSuccessfully installed gensim-3.8.3 smart-open-4.2.0


In [2]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [12]:
# Explore the word vector for "king"
wiki_embeddings['king'

100

In [8]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690191268921),
 ('son', 0.7020887136459351),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919990181922913),
 ('kingdom', 0.6811410188674927),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712857484817505),
 ('ii', 0.6676074266433716)]

### Train Our Own Model

In [9]:
# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [10]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [21]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [22]:
# Train the word2vec model
w2v_model = gensim.models.Word2Vec(X_train,size=100,window=5,min_count=2)

In [23]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([-0.09881073,  0.04424203,  0.04464885,  0.05877058,  0.02030942,
        0.05098787, -0.06458431,  0.03381761,  0.05297993,  0.01770404,
        0.05162285,  0.03423791,  0.08659415, -0.02860742,  0.03120998,
        0.01943754, -0.13832872,  0.03191939, -0.01461859,  0.0377093 ,
       -0.0096233 , -0.1385841 ,  0.04168329, -0.01442445,  0.04132958,
       -0.01865339, -0.08268036,  0.01309933,  0.10666372,  0.05084528,
       -0.00620096, -0.01429884,  0.02090278, -0.0195729 , -0.04069761,
       -0.01470586,  0.14539431, -0.02956892,  0.01911633, -0.00418478,
        0.07950969, -0.02649336,  0.06436272,  0.04547657, -0.03769205,
        0.06383693,  0.00707263, -0.00078081,  0.02016441,  0.01986244,
        0.01470285,  0.03444396,  0.10523828,  0.06478006,  0.03796462,
        0.00111462, -0.0794683 ,  0.06404164,  0.0202042 , -0.02375861,
        0.01089656, -0.03225251, -0.04756789,  0.00282726,  0.03421168,
       -0.01600826,  0.00733423, -0.07555767, -0.02006   ,  0.00

In [24]:
# Find the most similar words to "king" based o'n word vectors from our trained model
w2v_model.wv.most_similar('king')

[('wont', 0.9987363815307617),
 ('cant', 0.9986922740936279),
 ('other', 0.9986587762832642),
 ('liao', 0.9986525774002075),
 ('took', 0.9986279010772705),
 ('else', 0.9986164569854736),
 ('down', 0.9986104965209961),
 ('thats', 0.9986076354980469),
 ('de', 0.9986043572425842),
 ('today', 0.9985916018486023)]