<a href="https://colab.research.google.com/github/aayushkubb/nlp/blob/main/Text_Preprocessing_Vectors_Advance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://drive.google.com/uc?id=11WfnSPn79Opv2rwTxldDYcv4Dv0pV-f3' />

Implementing Word Embeddings
--
This section assumes that you have a working knowledge of how a neural
network works and you can terms like :

a. Deep learning

b. Perceptron and Sigmoid

c. FFNN ( feed forward Neural Network)

d. RNN (Recurrent Neural Network )

**( If new to a Neural Network (NN), it is suggested that you go through Chapter 1 to gain a basic understanding of how NN works. )

Even though all previous methods solve most of the problems, once we get into more complicated problems where we want to capture the semantic relation between the words, these methods fail to perform.

Below are the challenges:

• All these techniques fail to capture the context and meaning of the words. All the methods discussed so far basically depend on the appearance or frequency of the words. But we need to look at how to capture the context or semantic relations: that is, how frequently the words are appearing close by.

>a. I am eating an apple.

>b. I am using apple.

If you observe the above example, Apple gives different meanings when it is used with different (close by) adjacent words, eating and using.

• For a problem like a document classification (book classification in the library), a document is really huge and there are a humongous number of tokens
generated. In these scenarios, your number of features can get out of control (wherein) thus hampering the accuracy and performance.

A machine/algorithm can match two documents/texts and say whether they are same or not. But how do we make machines tell you about cricket or Virat Kohli when you search for MS Dhoni? How do you make a machine understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to the above questions lies in creating a representation for words that capture their meanings, semantic relationships, and the different types of contexts they are used in.

> The above challenges are addressed by Word Embeddings.

Word embedding is the feature learning technique where words from the vocabulary are mapped to vectors of real numbers capturing the contextual hierarchy.

If you observe the below table, every word is represented with 4 numbers called vectors. Using the word embeddings technique, we are going to derive those vectors for each and every word so that we can use it in future analysis. In the below example, the dimension is 4. But we usually use a dimension greater than 100.

<img src="https://drive.google.com/uc?id=165llWGYsReLC4BCtyZs6ZLYeggkg1k1m"  />

Problem
--
You want to implement word embeddings.

Solution
--
Word embeddings are prediction based, and they use shallow neural networks to train the model that will lead to learning the weight and using them as a vector representation.

<font color='green'>word2vec</font>
--
**word2vec** is the deep learning Google framework to train word embeddings. It will use all the words of the whole corpus and predict
the nearby words. It will create a vector for all the words present in the
corpus in a way so that the context is captured. It also outperforms any
other methodologies in the space of word similarity and word analogies.

There are mainly 2 types of word2vec Model.

• Skip-Gram

• Continuous Bag of Words (CBOW)

<img src="https://drive.google.com/uc?id=1ZC7kOYkuY2BGRCONWde38usTOCRJqJlR"/>

The above figure shows the architecture of the CBOW and skip-gram
algorithms used to build word embeddings. Let us see how these models
work in detail.

Skip-Gram
--
The skip-gram model is used to predict the probabilities of a word given the context of word or words.

Let us take a small sentence and understand how it actually works.
Each sentence will generate a target word and context, which are the words
nearby. The number of words to be considered around the target variable
is called the window size. The table below shows all the possible target
and context variables for window size 2. Window size needs to be selected
based on data and the resources at your disposal. The larger the window
size, the higher the computing power.

<img src="https://drive.google.com/uc?id=18nKDL_JAX96Zs_ILGMrcdd517GWLwrW2"/>

Since it takes a lot of text and computing power, let us go ahead and take sample data and build a skip-gram model.

As mentioned *in earlier NB's*, import the text corpus and break it into sentences. Perform **some cleaning and preprocessing** like the removal of
punctuation and digits, and split the sentences into words or tokens, etc.


In [11]:
# !pip uninstall gensim -y

In [12]:
#import library
!pip install gensim
import gensim
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot




In [13]:
length=[]
for i in sentences:
    print(len(i))
    length.append(len(i))

3
7
3
9
4


In [14]:
import numpy as np

In [15]:
np.mean(length)

5.2

In [16]:
np.median(length)

4.0

In [17]:
np.min(length)

3

In [18]:
np.max(length)

9

### training the model
https://radimrehurek.com/gensim/models/word2vec.html

In [19]:
#Example sentences
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
[ 'nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]


In [20]:
# training the model
skipgram = Word2Vec(sentences, size = 50, window = 3, min_count=1,sg = 1)
# size=50 -> means size of vector to represent each token or word (default 100)
# window=3 -> The maximum distance between the target word and its neighboring word.(default 5)
# min_count=1 -> Minimium frequency count of words. 
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant. (default 5)
# workers -> How many threads to use behind the scenes? (default 3) 
# sg -> (default 0 or CBOW) The training algorithm, either CBOW (0)     
#                           or skip gram (1).
# access vector for one word

print(skipgram['nlp'])

# Since our vector size parameter was 50, the model 
# gives a vector of size 50 for each word.

[ 0.00507417  0.00798924  0.00214572  0.00105395 -0.00132298 -0.00317645
 -0.00226661 -0.00463028  0.00154063  0.00764827  0.00364192  0.00231408
  0.00700611 -0.0036976  -0.00864142  0.00363485 -0.00610256 -0.00287278
  0.00247946  0.00054521  0.00839198 -0.00729082 -0.00313484 -0.00671063
  0.00820201 -0.00541423  0.00295527  0.00052783  0.00714301 -0.00989141
 -0.00738953 -0.00322712 -0.0037096  -0.0012615  -0.00717171 -0.00835311
 -0.00822171  0.0058208  -0.00906421 -0.00531903 -0.00016989 -0.00074663
 -0.00909911  0.00099359  0.00258831  0.0093498  -0.00546927 -0.00868888
  0.00358278 -0.00153129]


  del sys.path[0]


In [21]:
# access vector for another one word
print(skipgram['deep'])

  


KeyError: ignored

**Note** : We get an error saying the word doesn’t exist because this word was not there in our input training data. This is the reason we need to train the algorithm on as much data possible so that we do not miss out on words.


Continuous Bag of Words (CBOW)
--
Now let’s see how to build CBOW model. (Its very similar to SkipGram model)

In [22]:
#import library
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

#Example sentences
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
[ 'nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

In [23]:
# training the model
cbow = Word2Vec(sentences, size =50, window = 3, min_count=1,sg = 0)
# size=50 -> means size of vector to represent each token or word
# window=1 -> The maximum distance between the target word and its neighboring word.
# min_count=1 -> Minimium frequency count of words. 
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant.
# workers -> How many threads to use behind the scenes?
# as sg=0 i.e no skipgram , hence default CBOW

# access vector for one word
print(cbow['nlp'])

[ 0.00507417  0.00798924  0.00214572  0.00105395 -0.00132298 -0.00317645
 -0.00226661 -0.00463028  0.00154063  0.00764827  0.00364192  0.00231408
  0.00700611 -0.0036976  -0.00864142  0.00363485 -0.00610256 -0.00287278
  0.00247946  0.00054521  0.00839198 -0.00729082 -0.00313484 -0.00671063
  0.00820201 -0.00541423  0.00295527  0.00052783  0.00714301 -0.00989141
 -0.00738953 -0.00322712 -0.0037096  -0.0012615  -0.00717171 -0.00835311
 -0.00822171  0.0058208  -0.00906421 -0.00531903 -0.00016989 -0.00074663
 -0.00909911  0.00099359  0.00258831  0.0093498  -0.00546927 -0.00868888
  0.00358278 -0.00153129]


  if sys.path[0] == '':


Important Observation 
--
To train these models, it requires a huge amount of computing
power. So, let us go ahead and use Google’s pre-trained model, which has
been trained with over 100 billion words.

Download the model from the below path and keep it in your local
storage:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

or **better off from this link** :

https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

Note **if running on Jupyter NB** : The Google Db is soo large that we would get ValueError, like this : ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.


In [24]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# import gensim package
import gensim

# load the saved model
#model = gensim.models.KeyedVectors.load_word2vec_format('datasets/GoogleNews-vectors-negative300.bin', binary=True)
model = gensim.models.KeyedVectors.load_word2vec_format('https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [25]:
# lets check similarity
print (model.similarity('This', 'is'))

#Lets check one more.
print (model.similarity('post', 'book'))

#print(model.similarity('seed', 'need'))

0.3030219
0.057204384


“`This`” and “`is`” have a good amount of similarity, but the similarity
between the words “`post`” and “`book`” is poor. For any given set of words, it uses the vectors of both the words and calculates the similarity between them.

In [26]:
# Finding the odd one out.
model.doesnt_match('breakfast cereal dinner lunch'.split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'cereal'

Of '`breakfast`’, ‘`cereal`’, ‘`dinner`’ and ‘`lunch`', only **cereal** is the word that is
not anywhere related to the remaining 3 words.

In [27]:
# It is also finding the relations between words.
#model.most_similar(positive=['woman', 'king'] , negative=['man'])  # default value of topn is 10

# try this too :
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951)]

<img src="https://drive.google.com/uc?id=11Yu1Gj4Rw5BccL6KXnT_rXqYPyJbEUfZ"/>

# Implementing <font color='green'>fastText</font>
--
**fastText** is another deep learning framework developed by Facebook to capture context and meaning.

Problem
--
How to implement fastText in Python.

Solution
--
fastText is the improvised version of word2vec. word2vec basically
considers words to build the representation. But fastText takes each
character while computing the representation of the word.

In [28]:
# Let us see how to build a fastText word embedding.
# Import FastText
from gensim.models import FastText
from sklearn.decomposition import PCA
from matplotlib import pyplot

#Example sentences
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
[ 'nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

fast = FastText(sentences,size=10, window=1, min_count=1, workers=5, min_n=1, max_n=2)
# size=10 -> means size of vector to represent each token or word
# window=1 -> The maximum distance between the target word and its neighboring word.
# min_count=1 -> Minimium frequency count of words. 
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant.
# workers -> How many threads to use behind the scenes?
# min_n=1, max_n=2  -> When finding similarity or analogies like this :
# "Father" - "Boy" + "Girl" == "Mother"
#print(fast.most_similar(['girl', 'father'], ['boy'], topn=3))
# [('mother', 0.7996115684509277), ('grandfather', 0.7629683613777161), 
# ('wife', 0.7478234767913818)]
# we want the model to show min 1 and max 2 analogies


# vector for word nlp
print(fast['nlp'])


[ 0.0012215   0.02347985 -0.01262391 -0.02205611 -0.00839358 -0.00156716
 -0.00200545  0.00993094 -0.00620706 -0.0087144 ]




In [29]:
# Try this 
print(fast.most_similar(['machine', 'learning'], ['nlp'], topn=3))

[('saves', 0.8040894269943237), ('months', 0.6550357341766357), ('industry', 0.6398235559463501)]


  


<h3><font color='green'><b>I am sure !! </b></font> </h3>

By now you are familiar and comfortable with processing the natural language. Now that data is cleaned and features are created,let’s jump into building some applications around it that solves the business problem; in the <b>upcoming NB's</b>.

<font color='green'>Before Moving ahead <b>I would highly recommend</b> all watching this you tube <u><b>video</b></u> :</font> <br> https://www.youtube.com/watch?v=LSS_bos_TPI

<b>This would further clarify concept of Word Embeddings.</b>

<hr>

**Just in case**

You would love to explore **Stanford’s GloVe Embedding**  , very similar to above libraries :

https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

<hr>
<br><br>
<u><b>Further Resources</b></u> :

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

https://datascience.stackexchange.com/questions/22250/what-is-the-difference-between-a-hashing-vectorizer

In [30]:
documents = [
    "Human machine interface, for lab abc& computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees?",
    "Graph minors IV Widths of trees and well. quasi ordering",
    "Graph minors A survey",
]



### Activity

In [31]:
# cleaning the texts

# remove common words and tokenize

# remove words that appear only once

### Activity- Solution

In [32]:
from pprint import pprint  # pretty-printer
from collections import defaultdict
import re

#Get the character set
characters=set()
for sent in documents:
    for word in sent.split():
        for char in word:
            characters.add(char.lower())
            
# cleaning the texts

documents_clean=[]

for sent in documents:
#     print(sent)
    sent=re.sub("&","",sent)
    sent=re.sub(",","",sent)
    sent=re.sub("\?","",sent)
    sent=re.sub("\.","",sent)
#     print(sent)
    documents_clean.append(sent)

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents_clean
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

pprint(texts)


[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


In [35]:
skipgram = Word2Vec(texts, size = 50, window = 6, min_count=1,sg = 1)

In [37]:
vector = skipgram.wv['computer']  # get numpy vector of a word
sims = skipgram.wv.most_similar('computer', topn=10)  # get other similar words
print(sims)
#Try with more number of epochs

[('human', 0.21674661338329315), ('interface', 0.11780162155628204), ('survey', 0.11475444585084915), ('user', 0.10344666242599487), ('trees', 0.09261929988861084), ('graph', 0.03716857731342316), ('time', 0.028900787234306335), ('eps', 0.02266264148056507), ('system', -0.051353201270103455), ('minors', -0.0720280259847641)]


In [38]:
vector = skipgram.wv['computer']  # get numpy vector of a word
sims = skipgram.wv.most_similar('computer', topn=10)  # get other similar words
print(sims)
#Try with more number of epochs

[('human', 0.21674661338329315), ('interface', 0.11780162155628204), ('survey', 0.11475444585084915), ('user', 0.10344666242599487), ('trees', 0.09261929988861084), ('graph', 0.03716857731342316), ('time', 0.028900787234306335), ('eps', 0.02266264148056507), ('system', -0.051353201270103455), ('minors', -0.0720280259847641)]


# GLOVE Embeddings

In [40]:
import gensim.downloader

# Show all available models in gensim-data

pprint(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']


In [41]:
# Download the "glove-twitter-25" embeddings

glove_vectors = gensim.downloader.load('glove-twitter-25')

# Use the downloaded vectors as usual:

glove_vectors.most_similar('twitter')




[('facebook', 0.9480051398277283),
 ('tweet', 0.9403422474861145),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104823470115662),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885936141014099),
 ('tweets', 0.8878157734870911),
 ('tl', 0.8778461813926697),
 ('link', 0.877821147441864),
 ('internet', 0.8753897547721863)]

In [42]:
glove_vectors['twitter']

array([ 1.6952  ,  0.42694 ,  0.14433 , -0.16535 ,  1.0463  , -0.029846,
        0.33623 ,  1.5362  , -0.58481 ,  0.50349 , -0.50595 , -0.91136 ,
       -3.8011  , -0.8685  , -0.13552 ,  0.97055 , -0.13545 , -0.29825 ,
       -1.2837  , -0.63245 ,  0.44748 , -0.92231 , -0.4138  ,  0.20287 ,
       -0.33432 ], dtype=float32)


<hr>
<br><br>
<u><b>Further Resources</b></u> :

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

https://datascience.stackexchange.com/questions/22250/what-is-the-difference-between-a-hashing-vectorizer