# word2vec: How To Implement word2vec

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [1]:
# Install gensim
!pip install -U gensim

Collecting gensim
  Downloading gensim-4.3.2-cp311-cp311-macosx_10_9_x86_64.whl (24.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-6.4.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: smart-open, gensim
Successfully installed gensim-4.3.2 smart-open-6.4.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')

In [3]:
# Explore the word vector for "king"
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [5]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar('paris')

[('prohertrib', 0.7994136214256287),
 ('france', 0.7481586337089539),
 ('london', 0.7337678074836731),
 ('brussels', 0.7037920951843262),
 ('french', 0.6930579543113708),
 ('rome', 0.6879315972328186),
 ('amsterdam', 0.6758492588996887),
 ('vienna', 0.6608330607414246),
 ('berlin', 0.658585250377655),
 ('madrid', 0.6283904910087585)]

### Train Our Own Model

In [7]:
# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [9]:
# Clean data using the built in cleaner in gensim
# remove stop words, punctuation, and store it as tokens
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [10]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [15]:
# Train the word2vec model
# window = focus before and after
# word must occur twice in order to create a word vector
w2v_model = gensim.models.Word2Vec(X_train,
                                  window=5,
                                  min_count=2)

In [16]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([-0.03049584,  0.05908654, -0.01221557,  0.00654785,  0.00782712,
       -0.11355954,  0.03960763,  0.16859058, -0.05505425, -0.07101228,
       -0.04122804, -0.11441729, -0.00984241,  0.04428684,  0.00133595,
       -0.03766664,  0.00509274, -0.05654792, -0.01683887, -0.12704638,
        0.05623287,  0.02873287,  0.02458217, -0.04171779, -0.00358975,
        0.01430789, -0.07868354, -0.04266369, -0.0532028 ,  0.01720881,
        0.09535459,  0.00251194,  0.04635129, -0.07555608, -0.01028094,
        0.07856404,  0.02075547, -0.07206255, -0.04006901, -0.13436067,
        0.02851605, -0.04240368, -0.01806438,  0.00380154,  0.05759073,
       -0.04526116, -0.03424266, -0.00098525,  0.0609719 ,  0.04108587,
        0.05556122, -0.09098596, -0.00822042,  0.01071998, -0.03291715,
        0.06189279,  0.06091283, -0.01190098, -0.06284454,  0.00166784,
       -0.00070281,  0.03505878,  0.01306088,  0.00475657, -0.08665266,
        0.07971847,  0.01450196,  0.07068511, -0.11706933,  0.08

In [17]:
# Find the most similar words to "king" based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[('another', 0.9953930974006653),
 ('best', 0.9953485131263733),
 ('done', 0.9953358769416809),
 ('of', 0.9953308701515198),
 ('night', 0.9953134059906006),
 ('much', 0.9953122735023499),
 ('and', 0.995309054851532),
 ('down', 0.9952840209007263),
 ('soon', 0.9952489137649536),
 ('watch', 0.995211124420166)]