# word2vec: How To Implement word2vec

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [3]:
# Install gensim
!pip install gensim

Collecting gensim
  Using cached gensim-4.2.0.tar.gz (23.2 MB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Installing collected packages: gensim
  Running setup.py install for gensim: started
  Running setup.py install for gensim: finished with status 'error'


  DEPRECATION: gensim is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  error: subprocess-exited-with-error
  
  × Running setup.py install for gensim did not run successfully.
  │ exit code: 1
  ╰─> [609 lines of output]
      running install
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-311
      creating build\lib.win-amd64-cpython-311\gensim
      copying gensim\downloader.py -> build\lib.win-amd64-cpython-311\gensim
      copying gensim\interfaces.py -> build\lib.win-amd64-cpython-311\gensim
      copying gensim\matutils.py -> build\lib.win-amd64-cpython-311\gensim
      copying gensim\nosy.py -> build\lib.win-amd64-cpython-311\gensim
      copying g

In [4]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')

ModuleNotFoundError: No module named 'gensim'

In [5]:
# Explore the word vector for "king"
wiki_embeddings['king']

NameError: name 'wiki_embeddings' is not defined

In [4]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690191268921),
 ('son', 0.7020887136459351),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919990181922913),
 ('kingdom', 0.6811410188674927),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712857484817505),
 ('ii', 0.6676074266433716)]

### Train Our Own Model

In [5]:
# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [6]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [7]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [8]:
# Train the word2vec model
w2v_model = gensim.models.Word2Vec(X_train,
                                   size=100,
                                   window=5,
                                   min_count=2)

In [9]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([-0.05425125,  0.04536858, -0.09595686, -0.02699764,  0.11971134,
        0.10338762, -0.03818981, -0.01448795,  0.0171045 ,  0.00511975,
        0.01045221, -0.00677045, -0.12050592,  0.11097632, -0.04719375,
       -0.02802079,  0.01247429, -0.06322849,  0.06611794,  0.07224897,
       -0.02086301,  0.016499  ,  0.02015498,  0.00358362,  0.08886525,
       -0.099216  ,  0.06923407,  0.01566726, -0.05832795,  0.03870581,
       -0.02199215,  0.03693705, -0.00661952, -0.04715456,  0.07135164,
       -0.00723605,  0.02134361, -0.09508089, -0.00362955, -0.03568636,
        0.05925028, -0.01528659, -0.04217548,  0.01903476, -0.02175902,
       -0.08289368, -0.06005706, -0.02793312,  0.06268803,  0.06778472,
       -0.03594127,  0.11335944, -0.06159783, -0.0157827 , -0.03330815,
       -0.00814747, -0.08040741, -0.02449049, -0.02535428, -0.02809742,
        0.03898891, -0.03665545, -0.0125957 ,  0.04661012, -0.04162746,
       -0.04639079, -0.04960034, -0.07714609,  0.04107031, -0.09

In [10]:
# Find the most similar words to "king" based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[('show', 0.9984022974967957),
 ('being', 0.9983983039855957),
 ('coming', 0.9983887672424316),
 ('working', 0.9983633756637573),
 ('watching', 0.9983620643615723),
 ('boy', 0.998355507850647),
 ('gonna', 0.9983476400375366),
 ('poly', 0.9983355402946472),
 ('how', 0.9983333945274353),
 ('friends', 0.9983316659927368)]