# Word Embedding

- Word vectorization is the process of mapping words to a set of real numbers or vectors. This is done to process the given words using machine learning techniques and extract relevant information from them such that it can be used in further predicting words. Vectorization is done by comparing a given word to the corpus(collection) of the available words. 

- It is language modeling and feature learning technique. Word embedding is a way to perform mapping using a neural network. 
- There are various word embedding models available such as word2vec (Google), Glove (Stanford) and fastest (Facebook).
- We are going to discuss about word2vec in this tutorial

## Where it is being used
- `Compute similar words: `Word embedding is used to suggest similar words to the word being subjected to the prediction model. Along with that it also suggests dissimilar words, as well as most common words.

- `Create a group of related words:` It is used for semantic grouping which will group things of similar characteristic together and dissimilar far away.

- `Feature for text classification: `Text is mapped into arrays of vectors which is fed to the model for training as well as prediction. Text-based classifier models cannot be trained on the string, so this will convert the text into machine trainable form. Further its features of building semantic help in text-based classification.

- `Document clustering` is another application where word embedding is widely used

- `Natural language processing:` There are many applications where word embedding is useful and wins over feature extraction phases such as parts of speech tagging, sentimental analysis, and syntactic analysis.
Now we have got some knowledge of word embedding. Some light is also thrown on different models to implement word embedding. 

### Count Vectorizer

Count vectorizer uses two of the following models as the base to vectorize the given words on the basis of frequency of words.

#### Bag of Words Model
BOW model is used in NLP to represent the given text/sentence/document as a collection (bag) of words without giving any importance to grammar or the occurrence order of the words. It keeps the account of frequency of the words in the text document, which can be used as features in many models.

Let’s understand this with an example:

Text1 = “I went to have a cup of coffee but I ended up having lunch with her.”

Text2 = “I don’t understand, what is the problem here?”

BOW1 = {I :2, went : 1, to : 1,have : 1, a : 1, cup: 1, of :1, coffee : 1, but :1, ended : 1, up :1,having : 1, with :1, her :1}

BOW2 = {I : 1, don’t : 1, understand:1, what : 1 , is :1, the : 1, problem : 1, here : 1}

BOW is mainly used for feature selection. The above dictionary is converted as a list with only the frequency terms there and on that basis, weights are given to the most occurring terms. But the “stop words” are the most frequent words that appears in raw document. Thus, having a word with high frequency count doesn’t mean that the word is as important. To resolve this problem, “Tf-idf” was introduced. We will discuss about it later.

#### n-gram model

As discussed in bag of words model, BOW model doesn’t keep the sequence of words in a given text, only the frequency of words matters. It doesn’t take into account the context of the given sentence, or care for grammatical rules such as verb is following a proper noun in the given text.n-gram model is used in such cases to keep the context of the given text intact. N-gram is the sequence of n words from a given text/document.

When, n= 1, we call it a “unigram”.

             n=2, it is called a “bigram”. 
             
             n=3, it is called a “trigram”.
And so on.

Let’s understand this with an example:

Text1 = “I went to have a cup of coffee but I ended up having lunch with her.”

* Unigram 

[I, went, to, have, a, cup, of, coffee, but, I, ended, up, having, lunch, with, her]

* Bi-gram

[I went], [went to],[to have],[have a],[a cup],[cup f],[of coffee],[coffee but],[but I],[I ended],[ended up],
[up having],[having lunch],[lunch with],[with her]

* Tri-gram

[I went to], [went to have], [to have a], [have a cup],[ a cup of], [cup of coffee],[ of coffee but],[ coffee but I],[but I ended],[I ended up],[ended up having],[up having lunch],[having lunch with],[lunch with her].

Note: We can clearly see that BOW model is nothing but n-gram model when n=1.

Skip-grams

Skip grams are type of n-grams where the words are not necessarily in the same order as are in the given text i.e. some words can be skipped. 
Example:

Text2 = “I don’t understand, what is the problem here?”

1-skip 2-grams (we have to make 2-gram while skipping 1 word)

[I understand, don’t what, understand is, what the, is problem, the here].


Let's see the implementation of Count vectorizer in python:

##### Bag Of Words

In [17]:
# Example of single document
# Without stopwords

from sklearn.feature_extraction.text import CountVectorizer 

from nltk.corpus import stopwords
import pandas as pd

# Single document (',' seperates each document)
string = ["This is an example of bag of words!"]

# This step will convert text into tokens 
vect1 = CountVectorizer()

vect1.fit_transform(string)
print("bag of words :",vect1.get_feature_names())

bag of words : ['an', 'bag', 'example', 'is', 'of', 'this', 'words']


In [18]:
vect1.vocabulary_

{'this': 5, 'is': 3, 'an': 0, 'example': 2, 'of': 4, 'bag': 1, 'words': 6}

###### Fit and transform and predict if the word is present or not
- This is widely used for document or subject classification

In [24]:

c_vect = CountVectorizer()

c_vect.fit(string)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [26]:
string2 = ['Lets understand  bag']

c_new_vect = c_vect.transform(string2)

print ("Text Present at ",c_new_vect.toarray())

# Compare with the indexes
print ("original indexes", vect1.get_feature_names() )

Text Present at  [[0 1 0 0 0 0 0]]
original indexes ['bag', 'example', 'words']


In [20]:
## Bag Of Words using stopwords (you can avoid writing extra steps to remove stopwords)

stop_words = stopwords.words('english')

string = ["This is an example of bag of words!"]
vect1 = CountVectorizer(stop_words=stop_words)
print (vect1)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


In [21]:
vect1.fit_transform(string)
print("bag of words :",vect1.get_feature_names())
print("vocab        :",vect1.vocabulary_)

bag of words : ['bag', 'example', 'words']
vocab        : {'example': 1, 'bag': 0, 'words': 2}


In [22]:
# Using function
def text_matrix(message, countvect):
    terms_doc = countvect.fit_transform(message)
    return pd.DataFrame(terms_doc.toarray(),columns=countvect.get_feature_names())

In [23]:
message = ['We are slowly making progress in Natural Language Processing',
          "We will get there", "But practice is the only mantra for success" ]

c_vect = CountVectorizer()
print ("Below metrix is the Bag of Words approach")
text_matrix(message, c_vect)

Below metrix is the Bag of Words approach


Unnamed: 0,are,but,for,get,in,is,language,making,mantra,natural,only,practice,processing,progress,slowly,success,the,there,we,will
0,1,0,0,0,1,0,1,1,0,1,0,0,1,1,1,0,0,0,1,0
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1
2,0,1,1,0,0,1,0,0,1,0,1,1,0,0,0,1,1,0,0,0


##### n-grams

In [27]:

from sklearn.feature_extraction.text import CountVectorizer 

from nltk.tokenize import word_tokenize

string = ["This is an example of n-gram!"]

vect1 = CountVectorizer(ngram_range=(1,1))

vect1.fit_transform(string)

vect2 = CountVectorizer(ngram_range=(2,2))
vect2.fit_transform(string)

vect3 = CountVectorizer(ngram_range=(3,3))
vect3.fit_transform(string)

vect4 = CountVectorizer(ngram_range=(4,4))
vect4.fit_transform(string)

print("1-gram  :",vect1.get_feature_names())

print("2-gram  :",vect2.get_feature_names())
print("3-gram  :",vect3.get_feature_names())
print("4-gram  :",vect4.get_feature_names())

1-gram  : ['an', 'example', 'gram', 'is', 'of', 'this']
2-gram  : ['an example', 'example of', 'is an', 'of gram', 'this is']
3-gram  : ['an example of', 'example of gram', 'is an example', 'this is an']
4-gram  : ['an example of gram', 'is an example of', 'this is an example']


## Tf-Idf (Term frequency–Inverse document frequency)

Wikipedia definition:  ” Tf-Idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The Tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today.”


### Term Frequency
It is simply the frequency in which a word appears in a document in comparison to the total number words in the document. Mathematically given as:

Term frequency = (Number of times a word appears in the document) / (Total number of words in the document)

### Inverse Document Frequency

Term frequency has a disadvantage that it tends to give higher weights to words with higher frequency. In such cases words like ‘a’, ‘the’, ‘in’, ’of’ etc. appears more in the documents than other regular words. Thus, more important words are wrongly given lower weights as their frequency is less.
 To tackle this problem IDF was introduced. IDF decreases the weights of such high frequency terms and increases the weight of terms with rare occurrence. Mathematically it is given as:
 
Inverse Document Frequency = log [(Number of documents)/(Number of documents the word appears in)]   

**note: [log has base 2]**


*Tf-Idf Score = Term frequency * Inverse Document Frequency*

In [28]:
import numpy as np 
np.log2(3)

1.584962500721156

Let's understand more with an example:

Doc 1: This is an example.

Doc 2: We will see how it works.

Doc 3: IDF can be confusing.



<img src= "tfidf.PNG">

In the above table, we have calculated the term frequency as well as inverse document frequency of each of the words present in the 3 documents given. 

Now, let's calculate the tf-idf score for each term. Since, words of one document is not present in another document, we will have tf-idf value 0 for them e.g. words of doc1 will have 0 tf-idf for doc2 and doc3.

<img src= "tfidf2.PNG">

Great, hope this example must have cleared how Tf-Idf works. 

let's see the python implementation for it:

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd

tfid = TfidfVectorizer(smooth_idf=False)

doc= ["This is an example.","We will see how it works.","IDF can be confusing"]

doc_vector = tfid.fit_transform(doc)
#print(tfid.get_feature_names())

df= pd.DataFrame(doc_vector.todense(),columns=tfid.get_feature_names())
df
#print(doc_vector)

Unnamed: 0,an,be,can,confusing,example,how,idf,is,it,see,this,we,will,works
0,0.5,0.0,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.408248,0.408248,0.0,0.408248,0.408248,0.408248
2,0.0,0.5,0.5,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Although we are using the same data set as we used while doing manual calculation, the results are different than what we got.

This is because sklearn package have some modifications done to the formula to avoid complete avoidance of terms as well as to counter dividing by zero. 

You can know more by going through the official doumentation of sklearn as below:

"
   *The formula that is used to compute the tf-idf for a term t of a document d
    in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is
    computed as idf(t) = log [ n / df(t) ] + 1 (if ``smooth_idf=False``), where
    n is the total number of documents in the document set and df(t) is the
    document frequency of t; the document frequency is the number of documents
    in the document set that contain the term t. The effect of adding "1" to
    the idf in the equation above is that terms with zero idf, i.e., terms
    that occur in all documents in a training set, will not be entirely
    ignored.
    (Note that the idf formula above differs from the standard textbook
    notation that defines the idf as
    idf(t) = log [ n / (df(t) + 1) ]).
    If ``smooth_idf=True`` (the default), the constant "1" is added to the
    numerator and denominator of the idf as if an extra document was seen
    containing every term in the collection exactly once, which prevents
    zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1."*

In [30]:
# We will call the function created earlier
feb_message = ["What is that covid",
              "covid is nothing"]


tf = TfidfVectorizer()

#Passing same message with TF-IDF

text_matrix(feb_message,tf)

Unnamed: 0,covid,is,nothing,that,what
0,0.409937,0.409937,0.0,0.576152,0.576152
1,0.501549,0.501549,0.704909,0.0,0.0


In [31]:
# Importance of Covid increased based on the occurance and total document
jul_message = ["What is that covid covid",
              "covid is bad"]

text_matrix(jul_message,tf)

Unnamed: 0,bad,covid,is,that,what
0,0.0,0.668501,0.334251,0.469778,0.469778
1,0.704909,0.501549,0.501549,0.0,0.0


#### Countvectorizer,TF-IDF,n-grams

In [32]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

arr = ["Car was cleaned by Jack",
       "Jack was cleaned by Car."]


In [33]:
# If you want to take into account just term frequencies:
vectorizer = CountVectorizer(ngram_range=(2,2))

# The ngram range specifies your ngram configuration.

X = vectorizer.fit_transform(arr)

# Testing the ngram generation:
print("Feature Names \n",vectorizer.get_feature_names())


print('Array \n',X.toarray())



Feature Names 
 ['by car', 'by jack', 'car was', 'cleaned by', 'jack was', 'was cleaned']
Array 
 [[0 1 1 1 0 1]
 [1 0 0 1 1 1]]


In [34]:
# And now testing TFIDF vectorizer:
# You can still specify n-grams here.

vectorizer = TfidfVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(arr)


# Testing the TFIDF value + ngrams:
print(X.toarray())

[[0.         0.57615236 0.57615236 0.40993715 0.         0.40993715]
 [0.57615236 0.         0.         0.40993715 0.57615236 0.40993715]]


In [35]:
# Testing TFIDF vectorizer without normalization:
# You can still specify n-grams here.

vectorizer = TfidfVectorizer(ngram_range=(2, 2), norm=None)

X = vectorizer.fit_transform(arr)

# Testing TFIDF value before normalization:
print(X.toarray())

[[0.         1.40546511 1.40546511 1.         0.         1.        ]
 [1.40546511 0.         0.         1.         1.40546511 1.        ]]
