# Word Embedding Methods
The different types of word embeddings can be broadly classified into two categories:

<b>Frequency based Embedding:</b>
* Count Vector
* TF-IDF Vector
* Co-Occurrence Vector
* Global Vectors (GloVe) (Stanford, 2014): factorizes the logarithm of the corpus's word co-occurrence matrix,  similar to the count matrix you’ve used before.

<b>Prediction based Embedding:</b>
* Continuous bag-of-words (CBOW): the model learns to predict the center word given some context words.
* Continuous skip-gram / Skip-gram with negative sampling (SGNS): the model learns to predict the words surrounding a given input word.
* fastText (Facebook, 2016): based on the skip-gram model and takes into account the structure of words by representing words as an n-gram of characters. It supports out-of-vocabulary (OOV) words.

<b>Deep learning, contextual embeddings:</b>
* BERT (Google, 2018):
* ELMo (Allen Institute for AI, 2018)
* GPT-2 (OpenAI, 2018)



# Count Vector

Count vector summerizes word occurance with respect to each document.

In [55]:
import pandas as pd
import numpy as np

In [56]:
from sklearn.feature_extraction.text import CountVectorizer

document = ["Python is a great Language and this is Python Code",
            "Natural Lanugage Processing with Python is easy",
            "Count Vector is a Natural Lanugage Processing method"]
# Create a Vectorizer Object
vectorizer = CountVectorizer()
vectorizer.fit(document)
vector = vectorizer.transform(document)

In [57]:
vectorizer.vocabulary_.keys()

dict_keys(['python', 'is', 'great', 'language', 'and', 'this', 'code', 'natural', 'lanugage', 'processing', 'with', 'easy', 'count', 'vector', 'method'])

In [58]:
count_vector = pd.DataFrame(columns=vectorizer.get_feature_names(),
             index=list(range(len(document))))

count_vector.loc[:, :] = vector.toarray()


In [59]:
count_vector


Unnamed: 0,and,code,count,easy,great,is,language,lanugage,method,natural,processing,python,this,vector,with
0,1,1,0,0,1,2,1,0,0,0,0,2,1,0,0
1,0,0,0,1,0,1,0,1,0,1,1,1,0,0,1
2,0,0,1,0,0,1,0,1,1,1,1,0,0,1,0


The index of above matrix is number of documents and the columns are unique words (features) in the document. 

# TF-IDF

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. 

$$ TF-IDF = Term \quad Frequency \quad\times\quad Inverse \quad Document  \quad Frequency$$

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vectorizer = TfidfVectorizer()
tf_vector = tf_vectorizer.fit_transform(document)


In [61]:
tf_idf = pd.DataFrame(columns=tf_vectorizer.get_feature_names(),
                            index=list(range(len(document))))

tf_idf.loc[:, :] = tf_vector.todense()


In [62]:
tf_idf


Unnamed: 0,and,code,count,easy,great,is,language,lanugage,method,natural,processing,python,this,vector,with
0,0.338858,0.338858,0.0,0.0,0.338858,0.40027,0.338858,0.0,0.0,0.0,0.0,0.515421,0.338858,0.0,0.0
1,0.0,0.0,0.0,0.463121,0.0,0.273526,0.0,0.352215,0.0,0.352215,0.352215,0.352215,0.0,0.0,0.463121
2,0.0,0.0,0.443503,0.0,0.0,0.26194,0.0,0.337295,0.443503,0.337295,0.337295,0.0,0.0,0.443503,0.0
