# Word Embeddings in Action - Word2Vec

Steps to follow:

1. Get data
2. Clean text data
3. Tokenization
4. Prepare vocabulary
5. Download pre-trained embeddings
6. Get word vectors

In [3]:
# import required libraries
import numpy as np
import re

# 1. Get Data

In [6]:
#input text
text=['Building some bots for Wikipedia.',
      'Wikipedia is flooded with information.',
      'There is an app for everthing.']

# 2. Text Cleaning

In [7]:
# cleaning
import re

def clean(text):
  #lower case
  text=text.lower()
  
  #remove punctuations
  text=re.sub('[^a-zA-Z]'," ",text)
  
  return text

In [8]:
#call the clean function
cleaned_text=[]

for i in text:
  cleaned_text.append(clean(i))

# 3. Tokenization

In [9]:
#tokenize the text
tokens=[]

for i in cleaned_text:
  tokens.append(i.split())

print(tokens)

[['building', 'some', 'bots', 'for', 'wikipedia'], ['wikipedia', 'is', 'flooded', 'with', 'information'], ['there', 'is', 'an', 'app', 'for', 'everthing']]


# 4. Vocabulary Preparation

In [10]:
#construct vocabulary
vocab=[]

for i in tokens:
  for j in i:
    if j not in vocab:
      vocab.append(j)

#remove duplicate token
vocab = list(set(vocab))

print(vocab)

['an', 'flooded', 'everthing', 'for', 'bots', 'with', 'information', 'is', 'there', 'wikipedia', 'app', 'building', 'some']


#5. Feature Representation (word2vec)

### Download Google's pre-trained Word2Vec


In [4]:
# download and extract word2vec embeddings 

#! wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
#! gunzip GoogleNews-vectors-negative300.bin.gz

Since wget and gunzip are Linux utilities, this is the process I went through:

- Download the file and copy to D:\Large Filkes\Analytics_Vidhya
    - url = "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
- Used 7-Zip to decompress


In [None]:
# ! conda install -c conda-forge gensim

In [1]:
from gensim.models import KeyedVectors

# path of the downloaded model
# filename = 'GoogleNews-vectors-negative300.bin'
filename = "../../../../../LargeData/Analytics_Vidhya/GoogleNews-vectors-negative300.bin"

# load into gensim
w2vec = KeyedVectors.load_word2vec_format(filename, binary=True)

Once you have executed the above code, your word2vec embeddings are finally installed and loaded. 



<br>

Please note that the length of every vector of the pre-trained word2vec embeddings is 300.


In [14]:
# empty array of shape (no. of tokens (30) X 300) to store word2vec features
wordvec_array = np.zeros((len(vocab), 300))

for i,j in enumerate(vocab):
  wordvec_array[i,:] = w2vec.word_vec(j)

In [15]:
wordvec_array

array([[ 0.12597656,  0.19042969,  0.06982422, ...,  0.0612793 ,
         0.17285156, -0.07861328],
       [ 0.21679688, -0.03344727,  0.046875  , ..., -0.11914062,
        -0.09521484,  0.02612305],
       [ 0.22460938, -0.06542969, -0.08544922, ..., -0.14257812,
         0.04394531,  0.03271484],
       ...,
       [ 0.11572266, -0.29101562, -0.30664062, ..., -0.24609375,
        -0.17773438,  0.16113281],
       [-0.00976562,  0.02856445,  0.05419922, ..., -0.01300049,
         0.11621094,  0.02819824],
       [ 0.17871094,  0.09130859, -0.00165558, ...,  0.125     ,
         0.08056641,  0.01672363]])

Random Notes:

- Word embeddings capture the context of the text and is obtained using the NN model. The size of embedding is not equal to the size of the vocabulary size.
- Suppose you are learning word embedding for a vocabulary with 50000 words. Then the word vectors must be 50000 dimensional, in order to capture the full range of variations and meanings in those words.
    - False: The size of the word-vector is significantly smaller than the size of vocabulary. Generally between 50 to 40O.