<a href="https://colab.research.google.com/github/dedsecrattle/NLP-Tasks/blob/main/Word2Vec_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Word2Vec Tutorial

We're using the [Coronavirus Tweets Dataset](https://www.kaggle.com/datatattle/covid-19-nlp-text-classification) from Kaggle for this tutorial. You can check out my [medium article](https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673) that complements this.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the Dataset + Text Preprocessing

In [None]:
dataset = pd.read_csv('Corona_NLP_train.csv', encoding='latin1')

Very basic text preproccessing by removing punctuations, numbers. We have also cleaned the data by removing characters including and after 'https' in the text.

In [None]:
import re
texts=[]
for i in range(0,len(dataset)):
  text = re.sub('[^a-zA-Z]' , ' ', dataset['OriginalTweet'][i])
  text = text.lower()
  text = text.split()
  x = len(text) if text.count('https') ==0  else text.index('https')
  text = text[: x ]
  text = [t for t in text if not t=='https']
  text = ' '.join(text)
  texts.append(text)


Size of the dataset.

In [None]:
print(len(texts))

41157


## Training the word2vec model

In [None]:
from gensim.models import Word2Vec


In [None]:
sentences = [line.split() for line in texts]

w2v = Word2Vec(sentences, vector_size=100, window=7, workers=4, epochs=10, min_count=5)

In [None]:
print(sentences[20:25])

[['with', 'nations', 'inficted', 'with', 'covid', 'the', 'world', 'must', 'not', 'play', 'fair', 'with', 'china', 'goverments', 'must', 'demand', 'china', 'adopts', 'new', 'guilde', 'lines', 'on', 'food', 'safty', 'the', 'chinese', 'goverment', 'is', 'guilty', 'of', 'being', 'irosponcible', 'with', 'life', 'on', 'a', 'global', 'scale'], [], ['we', 'have', 'amazing', 'cheap', 'deals', 'for', 'the', 'covid', 'going', 'on', 'to', 'help', 'you', 'trials', 'monthly', 'yearly', 'and', 'resonable', 'prices', 'subscriptions', 'just', 'dm', 'us', 'bestiptv', 'iptv', 'service', 'iptv', 'iptvdeals', 'cheap', 'iptv', 'football', 'hd', 'movies', 'adult', 'cinema', 'hotmovies', 'iptvnew', 'iptv', 'adult'], ['we', 'have', 'amazing', 'cheap', 'deals', 'for', 'the', 'covid', 'going', 'on', 'to', 'help', 'you', 'trials', 'monthly', 'yearly', 'and', 'resonable', 'prices', 'subscriptions', 'just', 'dm', 'us', 'bestiptv', 'iptv', 'service', 'iptv', 'iptvdeals', 'cheap', 'iptv', 'football', 'hd', 'movies', 

## Working with word2vec

Finding the vocabulary of the model can be useful in several general applications, and in this case, provides us with a list of words we can try and use other functions.

In [None]:
words = list(w2v.wv.key_to_index) # Replace w2v.wv.vocab with w2v.wv.key_to_index to get a list of words in the vocabulary.

In [None]:
words = list(w2v.wv.key_to_index) # Replace w2v.wv.vocab with w2v.wv.key_to_index to get a list of words in the vocabulary.

In [None]:
print(words)



In [None]:
print(len(words))

10630


Finding the embedding of a given word can be useful when we’re trying to represent sentences as a collection of word embeddings, like when we’re trying to make a weight matrix for the embedding layer of a network.

In [None]:
print( w2v.wv['computer'] )

[ 0.23159945 -0.18364176  0.14183891  0.09314279 -0.35462636 -0.08657879
 -0.07290865  0.2628004  -0.0463119  -0.16386954 -0.01490174 -0.08685535
  0.14454587  0.10406194  0.21381761 -0.07882703  0.16532981 -0.26841873
  0.05641954 -0.543638    0.32544297  0.03115011  0.12321577 -0.10372474
  0.005913    0.22175255 -0.23617259  0.09690654  0.22735482  0.08703275
  0.19393748  0.03843571 -0.10819363 -0.11022658  0.00583764 -0.19993101
 -0.29541948 -0.13309762 -0.00819728 -0.08706216 -0.37351355  0.05079323
  0.04119847 -0.02772708 -0.15612394 -0.15639052 -0.17687808  0.10994948
 -0.1606558   0.2251356   0.16046125 -0.11558451 -0.15094745 -0.1581483
 -0.19140115 -0.06899387  0.0624784   0.04301355  0.11164516  0.13202554
 -0.22333479  0.31292397 -0.17123242  0.02903316 -0.42473325  0.29644543
  0.2212194   0.4903908   0.05889883  0.2358805   0.1093836   0.03939575
  0.22544415 -0.25327912 -0.01263572  0.0388947  -0.27391466  0.10163525
 -0.08740073  0.0451069  -0.13285542 -0.21722618 -0.

We can also find out the similarity between given words (the cosine distance between their vectors).

In [None]:
w2v.wv.similarity('vladimir', 'putin')

0.79823774

In [None]:
w2v.wv.similarity('covid', 'virus')

0.47756392

With the gensim, we can also find the most similar words to a given word.

In [None]:
print(w2v.wv.most_similar('china'))

[('japan', 0.6670913100242615), ('iran', 0.624318540096283), ('africa', 0.6218641400337219), ('feb', 0.6196733117103577), ('broad', 0.6073569655418396), ('alberta', 0.5985935926437378), ('asia', 0.5940816402435303), ('deficit', 0.5914011001586914), ('chinese', 0.5898476243019104), ('developing', 0.5886146426200867)]


In [None]:
print(w2v.wv.most_similar('covid'))

[('coronavirus', 0.5433706641197205), ('corona', 0.522556722164154), ('virus', 0.47756388783454895), ('convid', 0.45719149708747864), ('coronaoutbreak', 0.4487345814704895), ('stayathomeorder', 0.4484182298183441), ('stayhomestaysafe', 0.4454982876777649), ('covidbc', 0.44467100501060486), ('coronavirusindia', 0.4370630979537964), ('flattenthecurve', 0.43696171045303345)]


In [None]:
print(w2v.wv.most_similar('india'))

[('nigeria', 0.791071355342865), ('pakistan', 0.7620376944541931), ('drharshvardhan', 0.7053236365318298), ('amitshah', 0.6774893999099731), ('kenya', 0.6740648746490479), ('narendramodi', 0.6672999858856201), ('pmoindia', 0.6583511233329773), ('hindustan', 0.6577422618865967), ('finminindia', 0.6574849486351013), ('iran', 0.6541985869407654)]


Similarly, we can use the same function to find analogies of the form: if x:y, then z:?. Here we enter the known relation x,y in the positive parameter, and the term who’s analogy has to be found in the negative parameter.

In [None]:
print(w2v.wv.most_similar(positive=['russian', 'russia'], negative=['indian']))

[('saudi', 0.7510861158370972), ('putin', 0.7209022045135498), ('arabia', 0.708686351776123), ('donald', 0.7067505717277527), ('opec', 0.67552250623703), ('war', 0.666634738445282), ('saudis', 0.6621584892272949), ('trump', 0.6568707227706909), ('gop', 0.6483803391456604), ('mbs', 0.640464723110199)]


We also have this method which works similar to an "odd one out" situation.

In [None]:
w2v.wv.doesnt_match(['grocery', 'covid', 'coronavirus'])


'grocery'

## Visualising word vectors

Word2Vec word embedding can usually be of sizes 100 or 300, and it is practically not possible to visualise a 300 or 100 dimensional space with meaningful outputs. I used this [snippet](http://web.stanford.edu/class/cs224n/materials/Gensim%20word%20vector%20visualization.html) from [Stanford’s 224n course site](http://web.stanford.edu/class/cs224n/index.html#schedule), which basically provides you the option of either using a list of words, or providing the number of random samples you want to be displayed. In either case, it uses PCA to reduce the dimensionality and represent the word through their vectors on a 2-dimensional plane. The actual values of the axis are not of concern as they do not hold any significance, rather we can use it to perceive similar vectors to be densely located with respect to each other.


In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]

    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]

    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(w2v,['coronavirus', 'covid', 'virus', 'corona','disease', 'saudiarabia',  'doctor', 'hospital', 'pakistan', 'kenya',
                             'pay', 'paying', 'paid', 'wages', 'raise', 'bills', 'rent', 'charge']  )

TypeError: 'Word2Vec' object is not subscriptable

## Exporting pre trained embeddings + Saving/Loading our model

Gensim comes with several already pre-trained models, in the Gensim-data repository. We can import the downloader from the gensim library. We can use the following method to print the list of pre-trained models trained on large datasets available to us. This also  includes models like GloVe and fastext other than word2vec.



In [None]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))


['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


Here we have used  'word2vec-google-news-300' (trained on google-news, size is 300), and found words similar to ‘twitter’.

In [None]:
google_news = gensim.downloader.load('word2vec-google-news-300')
google_news.most_similar('twitter')



[('Twitter', 0.89089035987854),
 ('Twitter.com', 0.7536780834197998),
 ('tweet', 0.7431625723838806),
 ('tweeting', 0.7161933183670044),
 ('tweeted', 0.7137226462364197),
 ('facebook', 0.6988551616668701),
 ('tweets', 0.6974530816078186),
 ('Tweeted', 0.6950210928916931),
 ('Tweet', 0.6875007152557373),
 ('Tweeting', 0.6845167279243469)]

We can save our existing models and load them again.


In [None]:
w2v.save("word2vec.model")

In [None]:
model = Word2Vec.load("word2vec.model")
model.train([["hello", "world"]], total_examples=1, epochs=1)


(2, 2)