<a href="https://colab.research.google.com/github/arutraj/.githubcl/blob/main/7_Word_Embeddings_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Contents

### I. Download Pre-trained Word Embeddings
> ##### 1. Google's pre-trained Word2Vec
> ##### 2. Stanford NLP's pretrained GloVe
> ##### 3. Facebook's fastText
### II. Comparing Word Embedding Models
> ##### 1. Loading embeddings into Gensim
> ##### 2. Word representations
> ##### 3. Top similar words
> ##### 4. Contextual Relationship Between Words
### III. Train Word2Vec model from scratch
> ##### 1. Load Dataset
> ##### 2. Create Embeddings

# I. Download Pre-trained Word Embeddings

In [1]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Download models from here : https://developer.syn.co.in/tutorial/bot/oscova/pretrained-vectors.html#word2vec-and-glove-models

##1. Google's pre-trained Word2Vec

Google has released a pre-trained Word2Vec model that has the advantage of being trained on **Google's News data set of 3 million words**. You can __download__ the word2vec embeddings from this [link](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).

__Installation__

 - Make sure you have downloaded it in the same folder where this Jupyter notebook is residing.

 - Once you have finished downloading, you need to decompress the file and store in the same directory as the jupyter notebook

In [14]:
#!wget "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
#!gunzip GoogleNews-vectors-negative300.bin.gz
!gunzip /content/GoogleNews-vectors-negative300.bin.gz


gzip: /content/GoogleNews-vectors-negative300.bin.gz: not in gzip format


##2. Stanford NLP's pretrained GloVe

Stanford NLP's GloVe is trained on different datasets. Smallest is trained on Wikipedia and Gigawords dataset containing **6 Billion tokens and a vocabulary of around 400,000 words**.

__Installation__

 - Download the GloVe model from [Glove6B.zip](http://nlp.stanford.edu/data/glove.6B.zip).

 - Extract the zip file and store in the same sirectory as the jupyter notebook
 - Once you have extracted the file, you will see that there are multiple text files
     1. **glove.6B.50d.txt**  - Contains 50 dimension vectors for each word of the vocabulary.
     2. **glove.6B.100d.txt** - Contains 100 dimension vectors for each word of the vocabulary.
     3. **glove.6B.200d.txt** - Contains 200 dimension vectors for each word of the vocabulary.
     4. **glove.6B.300d.txt** - Contains 300 dimension vectors for each word of the vocabulary.

In [2]:
!wget "http://nlp.stanford.edu/data/glove.6B.zip"
!unzip glove.6B.zip

--2024-08-03 01:23:04--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-08-03 01:23:04--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-08-03 01:23:04--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

##3. Facebook's fastText

Facebook's fastText pre-trained model is trained on Wikipedia, UMBC webbase corpus and statmt.org news datasets. It contains around **16 Billion tokens and has a vocabulary of around 1 million words.**

__Installation__

 - Download the embeddings from this [link](https://fasttext.cc/docs/en/english-vectors.html)

 - Since we are working with wiki-news-300d-1M.vec, we recommend you to do so as well

 - Once you have finished downloading, you need to decompress the file and store in the same directory as the jupyter notebook

In [3]:
!wget "https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip"
!unzip wiki-news-300d-1M.vec.zip

--2024-08-03 01:26:26--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.226.210.78, 13.226.210.15, 13.226.210.25, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.226.210.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2024-08-03 01:26:33 (94.6 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   


#II. Comparing word embedding models

In [4]:
# Importing libraries
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

##1. Loading embeddings into Gensim

In [5]:
# Path to word2vec bin file
file_path = "/content/drive/MyDrive/AnalyticsVidya/NLP/GoogleNews-vectors-negative300.bin"

# Load into gensim
w2vec = KeyedVectors.load_word2vec_format(file_path, binary=True)

In [6]:
# Path to glove file
glove_input_file = '/content/glove.6B.300d.txt'

# Path to Word2Vec format output file
glove_word2vec_output_file = '/content/glove.6B.300d.word2vec.txt'

# Save in Word2vec format
glove2word2vec(glove_input_file, glove_word2vec_output_file)

# Load into gensim
glove = KeyedVectors.load_word2vec_format(glove_word2vec_output_file, binary=False)

  glove2word2vec(glove_input_file, glove_word2vec_output_file)


In [7]:
# Path to fasttext vector file
file_path = '/content/wiki-news-300d-1M.vec'

# Load into gensim
ft = KeyedVectors.load_word2vec_format(file_path, binary=False)

##2. Word representations

In [8]:
# Word2vec representation
w2vec['king']

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [9]:
# Glove embeddings
glove['king']

array([ 0.0033901, -0.34614  ,  0.28144  ,  0.48382  ,  0.59469  ,
        0.012965 ,  0.53982  ,  0.48233  ,  0.21463  , -1.0249   ,
       -0.34788  , -0.79001  , -0.15084  ,  0.61374  ,  0.042811 ,
        0.19323  ,  0.25462  ,  0.32528  ,  0.05698  ,  0.063253 ,
       -0.49439  ,  0.47337  , -0.16761  ,  0.045594 ,  0.30451  ,
       -0.35416  , -0.34583  , -0.20118  ,  0.25511  ,  0.091111 ,
        0.014651 , -0.017541 , -0.23854  ,  0.48215  , -0.9145   ,
       -0.36235  ,  0.34736  ,  0.028639 , -0.027065 , -0.036481 ,
       -0.067391 , -0.23452  , -0.13772  ,  0.33951  ,  0.13415  ,
       -0.1342   ,  0.47856  , -0.1842   ,  0.10705  , -0.45834  ,
       -0.36085  , -0.22595  ,  0.32881  , -0.13643  ,  0.23128  ,
        0.34269  ,  0.42344  ,  0.47057  ,  0.479    ,  0.074639 ,
        0.3344   ,  0.10714  , -0.13289  ,  0.58734  ,  0.38616  ,
       -0.52238  , -0.22028  , -0.072322 ,  0.32269  ,  0.44226  ,
       -0.037382 ,  0.18324  ,  0.058082 ,  0.26938  ,  0.3620

In [10]:
# fastText embeddings
ft['king']

array([ 1.082e-01,  4.450e-02, -3.840e-02,  1.100e-03, -8.880e-02,
        7.130e-02, -6.960e-02, -4.770e-02,  7.100e-03, -4.080e-02,
       -7.070e-02, -2.660e-02,  5.000e-02, -8.240e-02,  8.480e-02,
       -1.627e-01, -8.510e-02, -2.950e-02,  1.534e-01, -1.828e-01,
       -2.208e-01,  2.430e-02, -9.210e-02, -1.089e-01, -1.009e-01,
       -1.190e-02,  3.770e-02,  2.038e-01,  7.200e-02,  2.020e-02,
        2.798e-01,  1.150e-02, -1.510e-02,  1.037e-01,  4.000e-04,
       -1.040e-02,  1.960e-02,  1.265e-01,  8.280e-02, -1.369e-01,
        1.070e-01,  1.270e-01, -3.490e-02, -6.830e-02, -1.140e-02,
        3.370e-02,  1.260e-02,  7.920e-02,  4.400e-02, -2.530e-02,
        4.890e-02, -7.850e-02, -6.259e-01, -9.720e-02,  1.654e-01,
       -5.780e-02, -4.370e-02,  4.090e-02, -1.820e-02, -1.891e-01,
        2.770e-02, -1.460e-02, -5.310e-02,  4.260e-02,  4.900e-03,
        4.000e-03,  1.423e-01, -9.750e-02, -3.500e-03,  9.630e-02,
       -1.900e-03, -1.466e-01, -1.662e-01,  6.650e-02, -1.500e

##3. Top similar words

In [11]:
# Top similar words
w2vec.most_similar(['king'], topn=5)

[('kings', 0.7138045430183411),
 ('queen', 0.6510956883430481),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204220056533813),
 ('prince', 0.6159993410110474)]

In [12]:
# Top similar words
glove.most_similar(['king'], topn=5)

[('queen', 0.6336469054222107),
 ('prince', 0.6196622848510742),
 ('monarch', 0.5899620652198792),
 ('kingdom', 0.5791266560554504),
 ('throne', 0.5606487989425659)]

In [13]:
# Top similar words
ft.most_similar(['king'], topn=5)

[('kings', 0.7969563603401184),
 ('queen', 0.763853907585144),
 ('monarch', 0.739997148513794),
 ('King', 0.728195309638977),
 ('prince', 0.7132730484008789)]

##4. Contextual Relationship Between Words

Example: airplane - fly + drive = car

In [14]:
# airplane - fly + drive
w2vec.most_similar(positive=['airplane', 'drive'], negative=['fly'], topn=5)

[('car', 0.511200487613678),
 ('drives', 0.47777244448661804),
 ('automobile', 0.45616626739501953),
 ('vehicle', 0.44856154918670654),
 ('SUV', 0.44360119104385376)]

In [15]:
# airplane - fly + drive
glove.most_similar(positive=['airplane', 'drive'], negative=['fly'], topn=5)

[('car', 0.5835879445075989),
 ('drives', 0.5498395562171936),
 ('vehicle', 0.5255967378616333),
 ('truck', 0.48848605155944824),
 ('automobile', 0.47820839285850525)]

In [16]:
# airplane - fly + drive
ft.most_similar(positive=['airplane', 'drive'], negative=['fly'], topn=5)

[('automobile', 0.6070601344108582),
 ('car', 0.6051998138427734),
 ('drives', 0.5803263783454895),
 ('automobiles', 0.557213544845581),
 ('vehicle', 0.5500436425209045)]

# III. Train Word2Vec model from scratch

##1. Load Dataset

In [17]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec

In [18]:
# Load the dataset
df = pd.read_csv('/content/tweets.csv')
df.head()

Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,RT @rssurjewala: Critical question: Was PayTM ...,False,0.0,,2016-11-23 18:40:30,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331.0,True,False
1,RT @Hemant_80: Did you vote on #Demonetization...,False,0.0,,2016-11-23 18:40:29,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",PRAMODKAUSHIK9,66.0,True,False
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0.0,,2016-11-23 18:40:03,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12.0,True,False
3,RT @ANI_news: Gurugram (Haryana): Post office ...,False,0.0,,2016-11-23 18:39:59,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",deeptiyvd,338.0,True,False
4,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0.0,,2016-11-23 18:39:39,False,,8.014954e+17,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120.0,True,False


In [19]:
# Dropping irrelevant columns
df.drop(df.columns[1:],axis=1,inplace=True)
df.head()

Unnamed: 0,text
0,RT @rssurjewala: Critical question: Was PayTM ...
1,RT @Hemant_80: Did you vote on #Demonetization...
2,"RT @roshankar: Former FinSec, RBI Dy Governor,..."
3,RT @ANI_news: Gurugram (Haryana): Post office ...
4,RT @satishacharya: Reddy Wedding! @mail_today ...


##2. Create Embeddings

In [20]:
# Import relevant libraries
import re
import spacy

# Load English language model
nlp = spacy.load('en_core_web_sm')

In [21]:
# 22
# Preprocessing function
def clean(text):

    # Lowercase
    text = text.lower()

    # Remove non-alphanumeric words
    text = ' '.join(re.compile(r'[^a-zA-Z0-9]+').split(text))

    # Create spacy object
    doc = nlp(text)

    # List to store clean text
    filtered_text = []

    # Iterate over document and save word lemmas
    for token in doc:
        filtered_text.append(token.lemma_)

    return " ".join(word for word in filtered_text)

In [22]:
# Apply the function
df['text_clean'] = df['text'].apply(clean)

In [23]:
# 24
# Print data
df.head()

Unnamed: 0,text,text_clean
0,RT @rssurjewala: Critical question: Was PayTM ...,rt rssurjewala critical question be paytm info...
1,RT @Hemant_80: Did you vote on #Demonetization...,rt hemant 80 do you vote on demonetization on ...
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",rt roshankar former finsec rbi dy governor cbd...
3,RT @ANI_news: Gurugram (Haryana): Post office ...,rt ani news gurugram haryana post office emplo...
4,RT @satishacharya: Reddy Wedding! @mail_today ...,rt satishacharya reddy wedding mail today cart...


In [24]:
# Break docs into separate sentences
def sents(doc):

    # Split into individual sentences
    text = re.split('[.?]\s+',doc)
    # List to save the sentences
    clean_sent = []

    # Iterate over the sentences
    for sent in text:
        if len(sent)!=0:
            # Remove leading and trailing whitespaces
            sent = sent.strip()
            # Tokenize sentences
            clean_sent.append([word for word in sent.split()])
    # Return list of sentences in a single document
    return clean_sent

In [25]:
# Apply the function
df['text_sents'] = df['text_clean'].apply(sents)

In [26]:
# Print output
df.head()

Unnamed: 0,text,text_clean,text_sents
0,RT @rssurjewala: Critical question: Was PayTM ...,rt rssurjewala critical question be paytm info...,"[[rt, rssurjewala, critical, question, be, pay..."
1,RT @Hemant_80: Did you vote on #Demonetization...,rt hemant 80 do you vote on demonetization on ...,"[[rt, hemant, 80, do, you, vote, on, demonetiz..."
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",rt roshankar former finsec rbi dy governor cbd...,"[[rt, roshankar, former, finsec, rbi, dy, gove..."
3,RT @ANI_news: Gurugram (Haryana): Post office ...,rt ani news gurugram haryana post office emplo...,"[[rt, ani, news, gurugram, haryana, post, offi..."
4,RT @satishacharya: Reddy Wedding! @mail_today ...,rt satishacharya reddy wedding mail today cart...,"[[rt, satishacharya, reddy, wedding, mail, tod..."


In [28]:
# Sample sentence
df.loc[2,'text_sents']

[['rt',
  'roshankar',
  'former',
  'finsec',
  'rbi',
  'dy',
  'governor',
  'cbdt',
  'chair',
  'harvard',
  'professor',
  'lambaste',
  'demonetization',
  'if',
  'not',
  'for',
  'aam',
  'aadmi',
  'listen',
  'to',
  'th']]

In [29]:
# Combine all sentences into a single list for word embedding training
combined_sent = []
for i in range(len(df)):
    combined_sent += df.loc[i,'text_sents']

In [30]:
# Create word2vec model
model = Word2Vec(combined_sent, vector_size=100, window=2, sg=0, min_count=5, workers=1)
# Save model
# model.save(r'/word_vec3.bin')

In [31]:
# Word vector rerpesentation
model.wv.get_vector('demonetization')

array([-0.02754054,  0.16052178,  0.05712555,  0.14088231, -0.02010597,
       -0.4330258 ,  0.15890515,  0.5537399 , -0.44242775, -0.24847883,
       -0.08436187, -0.4814396 , -0.01283252,  0.12943247,  0.10277826,
       -0.26738325,  0.17479949, -0.2931855 , -0.15519771, -0.83086485,
        0.32769918,  0.0567171 ,  0.24379413, -0.6202791 , -0.1109392 ,
       -0.08876536, -0.20185862, -0.36074582, -0.13027784,  0.25747943,
        0.26371956, -0.09523165,  0.2604141 , -0.3521417 , -0.2590483 ,
        0.52436495,  0.0236279 , -0.20560113, -0.26361057, -0.7474145 ,
        0.16785295, -0.323971  ,  0.19714673,  0.27362373,  0.36755815,
       -0.1183477 , -0.5807419 , -0.1112928 ,  0.15974249,  0.10592769,
       -0.09118494, -0.25422937,  0.05314321,  0.02989964, -0.30875322,
       -0.08642396,  0.15033762,  0.01657899, -0.55670357, -0.1214347 ,
        0.25554168,  0.07016635,  0.26148704, -0.21671309, -0.3805588 ,
        0.22985356,  0.02651777,  0.43240866, -0.4517475 ,  0.02

In [32]:
# Simlar word
model.wv.most_similar(model.wv.get_vector('demonetization'),topn=10)

[('demonetization', 0.9999999403953552),
 ('digital', 0.9893613457679749),
 ('effect', 0.988767683506012),
 ('on', 0.9887349009513855),
 ('about', 0.9885606169700623),
 ('more', 0.9885531067848206),
 ('against', 0.98691326379776),
 ('bank', 0.9843961596488953),
 ('india', 0.9832715392112732),
 ('economy', 0.9820733666419983)]

In [33]:
# Simlar word
model.wv.most_similar(model.wv.get_vector('india'),topn=10)

[('india', 1.0),
 ('an', 0.9881612658500671),
 ('demonetization', 0.9832715392112732),
 ('article', 0.9803057909011841),
 ('digital', 0.9776921272277832),
 ('system', 0.9738833904266357),
 ('education', 0.9734907746315002),
 ('on', 0.9700151085853577),
 ('like', 0.9678308367729187),
 ('move', 0.9654936790466309)]

In [34]:
# Simlar word
model.wv.most_similar(model.wv.get_vector('economy'),topn=10)

[('economy', 1.0),
 ('effect', 0.9968200922012329),
 ('against', 0.996390700340271),
 ('catalyze', 0.9962177276611328),
 ('bank', 0.9958115220069885),
 ('gov', 0.9957634806632996),
 ('late', 0.9956380128860474),
 ('crunch', 0.9956159591674805),
 ('more', 0.9955056309700012),
 ('gujarat', 0.9947516322135925)]

In [35]:
# Simlar word
model.wv.most_similar(model.wv.get_vector('flood'),topn=10)

KeyError: "Key 'flood' not present"