# Word Embedding Methods
The different types of word embeddings can be broadly classified into two categories:

<b>Frequency based Embedding:</b>
* Count Vector
* TF-IDF Vector
* Co-Occurrence Vector
* Global Vectors (GloVe) (Stanford, 2014): factorizes the logarithm of the corpus's word co-occurrence matrix,  similar to the count matrix you’ve used before.

<b>Prediction based Embedding:</b>
* Continuous bag-of-words (CBOW): the model learns to predict the center word given some context words.
* Continuous skip-gram / Skip-gram with negative sampling (SGNS): the model learns to predict the words surrounding a given input word.
* fastText (Facebook, 2016): based on the skip-gram model and takes into account the structure of words by representing words as an n-gram of characters. It supports out-of-vocabulary (OOV) words.

<b>Deep learning, contextual embeddings:</b>
* BERT (Google, 2018):
* ELMo (Allen Institute for AI, 2018)
* GPT-2 (OpenAI, 2018)



# Frequency Based Embedding

## Count Vector

Count vector summerizes word occurance with respect to each document.

In [2]:
import pandas as pd
import numpy as np

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

document = ["Python is a great Language and this is Python Code",
            "Natural Lanugage Processing with Python is easy",
            "Count Vector is a Natural Lanugage Processing method"]
# Create a Vectorizer Object
vectorizer = CountVectorizer()
vectorizer.fit(document)
vector = vectorizer.transform(document)

In [4]:
vectorizer.vocabulary_.keys()

dict_keys(['python', 'is', 'great', 'language', 'and', 'this', 'code', 'natural', 'lanugage', 'processing', 'with', 'easy', 'count', 'vector', 'method'])

In [5]:
count_vector = pd.DataFrame(columns=vectorizer.get_feature_names(),
             index=list(range(len(document))))

count_vector.loc[:, :] = vector.toarray()


In [6]:
count_vector


Unnamed: 0,and,code,count,easy,great,is,language,lanugage,method,natural,processing,python,this,vector,with
0,1,1,0,0,1,2,1,0,0,0,0,2,1,0,0
1,0,0,0,1,0,1,0,1,0,1,1,1,0,0,1
2,0,0,1,0,0,1,0,1,1,1,1,0,0,1,0


The index of above matrix is number of documents and the columns are unique words (features) in the document. 

## TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. 

$$ TF-IDF = Term \quad Frequency \quad\times\quad Inverse \quad Document  \quad Frequency$$

### Term Frequency

TF measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

$$TF(t) = \frac{Count(term,document)} {Count(all \ terms, document)}$$

### Inverse Document Frequency

IDF measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

$$IDF(t) = log_e(\frac{Total\ number\ of\ documents}{Number\ of\ documents\ with\ term\ t \ in \ it}) $$

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vectorizer = TfidfVectorizer()
tf_vector = tf_vectorizer.fit_transform(document)


In [8]:
tf_idf = pd.DataFrame(columns=tf_vectorizer.get_feature_names(),
                            index=list(range(len(document))))

tf_idf.loc[:, :] = tf_vector.todense()


In [9]:
tf_idf


Unnamed: 0,and,code,count,easy,great,is,language,lanugage,method,natural,processing,python,this,vector,with
0,0.338858,0.338858,0.0,0.0,0.338858,0.40027,0.338858,0.0,0.0,0.0,0.0,0.515421,0.338858,0.0,0.0
1,0.0,0.0,0.0,0.463121,0.0,0.273526,0.0,0.352215,0.0,0.352215,0.352215,0.352215,0.0,0.0,0.463121
2,0.0,0.0,0.443503,0.0,0.0,0.26194,0.0,0.337295,0.443503,0.337295,0.337295,0.0,0.0,0.443503,0.0


As it can be seen, for example the first document weights Python heavily using TF-IDF.

## Co-Occurrence Matrix with a fixed context window

Co-Occurrence frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Similar words tend to occur together and will have similar context for example – Apple is a fruit. Mango is a fruit.

In [18]:
count_model = CountVectorizer()
X = count_model.fit_transform(document)
Xc = (X.T * X) 
Xc.setdiag(0) 
co_occurrence=pd.DataFrame(columns=count_model.get_feature_names(),index=count_model.get_feature_names())
co_occurrence.loc[:,:]=Xc.todense()
co_occurrence

Unnamed: 0,and,code,count,easy,great,is,language,lanugage,method,natural,processing,python,this,vector,with
and,0,1,0,0,1,2,1,0,0,0,0,2,1,0,0
code,1,0,0,0,1,2,1,0,0,0,0,2,1,0,0
count,0,0,0,0,0,1,0,1,1,1,1,0,0,1,0
easy,0,0,0,0,0,1,0,1,0,1,1,1,0,0,1
great,1,1,0,0,0,2,1,0,0,0,0,2,1,0,0
is,2,2,1,1,2,0,2,2,1,2,2,5,2,1,1
language,1,1,0,0,1,2,0,0,0,0,0,2,1,0,0
lanugage,0,0,1,1,0,2,0,0,1,2,2,1,0,1,1
method,0,0,1,0,0,1,0,1,0,1,1,0,0,1,0
natural,0,0,1,1,0,2,0,2,1,0,2,1,0,1,1


# Prediction Based Embedding

Prediction based methods are supervised learning methods using models such as the neural network models. 

## Continuous Bag of words (CBOW)

<b>The bag-of-words model</b> is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the context_window words.




## Continuous Skip-Gram 

Skip-gram model reverses the use of target and context words. In this case, the target word is fed at the input, the hidden layer remains the same, and the output layer of the neural network is replicated multiple times to accommodate the chosen number of context words. 