The following is based on:
- https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/6980811-apply-a-simple-bag-of-words-approach
- https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/7067116-apply-the-tf-idf-vectorization-approach
- https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/6980911-discover-the-power-of-word-embeddings
- https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/
- https://ashutoshtripathi.com/2020/09/04/word2vec-and-semantic-similarity-using-spacy-nlp-spacy-series-part-7/
- https://israelg99.github.io/2017-03-23-Word2Vec-Explained/

# 4. Text vectorization

Vectorization is the general process of turning a collection of text documents (a corpus) into numerical feature vectors fed to machine learning algorithms for modeling. When you vectorize the corpus, you convert each word or token from the documents into an array of numbers. This array is the vector representation of the word.

## 4.1. Bag of words

Bag-of-words (BOW) is a simple but powerful approach to vectorizing text.

As the name may suggest, the bag-of-words technique does not consider the position of a word in a document. The idea is to count the number of times each word appears in each of the documents. It is a simple method, but it works.

Consider the three following documents and count the number of times each word appears in each sentence.

<img src="https://user.oc-static.com/upload/2020/10/23/16034397439042_surfin%20bird%20bow.png" alt="drawing" width="600"/>

The matrix calculated on this simple example of three sentences can be generalized to many documents in the corpus. Each document is a row, and each token is a column. Such a matrix is called the document-term matrix.

Note that the size of the document-term matrix is:
number of documents ∗ size of vocabulary


You can use the CountVectorizer from scikit-learn (you can read more on the official documentation page https://scikit-learn.org/stable/modules/feature_extraction.html) to generate the document-term matrix from a corpus with the following code:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
    '2 cups of flour',
    'replace the flour',
    'replace the keyboard in 2 minutes',
    'do you prefer Windows or Mac',
    'the Mac has the most noisy keyboard',
]
X = vectorizer.fit_transform(corpus)
X.todense()

matrix([[1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
        [0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
        [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1],
        [0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 2, 0, 0]])

Each row corresponds to one of the sentences, and each column to a word in the corpus. For instance, the appears once in documents two and three and twice in document five, while the word flour appears once in documents one and two. The vocabulary is strongly related to the sentence topic: the word flour only appears in documents about recipes. On the other hand, the is less specific.

Reducing the size of the vocabulary is important to avoid performing calculations over gigantic matrices. While removing stop words and lemmatizing helps reduce the size of the vocabulary significantly, it's often not enough.
Therefore, reducing the size of the vocabulary is crucial. The idea is to remove as many tokens as possible without throwing away relevant information. It's a delicate balance that is entirely dependent on the context. One strategy can be to filter out words that are either too frequent or too rare. Another strategy involves applying dimension reduction techniques (PCA) to the document-term matrix.

## 4.2. Tf-idf

The problem with counting word occurrences is that some words appear only in a limited number of documents. The model will learn that pattern, overfit the training data, and fail to generalize to new texts properly. Similarly, words that are present in all the documents will not bring any information to the model.

For this reason, it is sometimes better to normalize the word counts by the number of times they appear in the documents. This is the general idea behind the tf-idf vectorization.

Let's look more closely at what tf-idf stands for:

- Tf stands for term frequency, the number of times the word appears in each document. 
- Idf stands for inverse document frequency, an inverse count of the number of documents a word appears in. Idf measures how significant a word is in the whole corpus.

If you multiply tf with idf, you get the tf-idf score: 

tf(t,d)=|Number of times term t appears in document d|

idf(t,D)=|Number of documents|/| number of documents that contain term t|

tfidf(t,d,D)=tf(t,d).idf(t,D)

- t is the word or token.
- d is the document.
- D is the set of documents in the corpus.

A tf-idf score is a decimal number that measures the importance of a word in any document. It gives small values to frequent words in all the documents and more weight to those more scarce across the corpus. 

In scikit-learn, tf-Idf is implemented as the  TfidfVectorizer (you can read more on the scikit-learn documentation https://scikit-learn.org/stable/modules/feature_extraction.html).

There are multiple ways to calculate the tf or the idf part of tf-idf depending on whether you want to maximize the impact of rare words or lower the role of frequent ones. For instance, when the corpus is composed of documents of varying length, you can normalize by the size of each document:

tf(t,d)=nt,d / Vocab size of the document d
Take the log:

tf(t,d)=log(nt,d+1)

Here, nt,d is the number of times the term t appears in the document d.



Choosing the  CountVectorizer over the TfidfVectorizer depends on the nature of the documents you are working on. Tf-idf does not always bring better results than merely counting the occurrences of the words in each document!

Here are a couple of cases where tf could perform better than tf-idf :

- If words are distributed equally across the documents, then normalizing by idf will not matter much. As such, taking into account each word's specificity across the corpus does not improve the model's performance.
- If rare words do not carry valuable meaning to the classification model, then td-idf does not have a particular advantage. For example, when someone uses slang, that means something general in a comment on social media. 

By concatenating each document's scores in the corpus, you get a vector. The dimension of the word vector equals the number of documents in the corpus. For example, if the corpus holds four documents, the vector's dimension is 4. For a corpus of 1000 documents, the vector dimension is 1000. Words that are not in the corpus do not get a vector representation, meaning that the vocabulary size and elements are also entirely dependent on the corpus at play. In short, tf-idf vectorization gives a numerical representation of words entirely dependent on the nature and number of documents being considered. The same words will have different vector representations in another corpus.

We will explore numerical representations of words called embeddings (Word2vec, GloVe, fastText) the next part. These techniques are absolute and not dependent on the corpus, which is an important distinction!

## 4.3. Words embeddings

We have seen that the bag-of-words and tf-idf approaches are simple and quite efficient methods, but have several shortcomings:
- Context and meaning is lost.
- The document-term matrix is large and sparse.
- Vectorization is relative to the corpus (similar words will have different vectors on another corpus).

In 2013, a new text vectorizing method called embeddings took NLP by storm. An embedding technique called Word2vec was born, soon to be followed by GloVe and fastText. These new text vectorization techniques solved the inherent shortcomings of bag of words and tf-idf approaches. They also somehow managed to retain semantic similarity between words, meaning that these vectors can recognize the meaning of a word and determine its similarity to others. Let's explore further:

### 4.3.1. They Retain Semantic Similarity

As mentioned, one of the most remarkable properties of embeddings is their ability to capture the semantic relationship between words. For example: A hammer and pliers are both tools. Since they are related or similar in meaning, their vectors will be near one another. Similar to the words apple and pear or truck and vehicle.

When visualizing word vectors in a 2D space, similar words are grouped in the same regions. The figure below shows the five most similar words: Paris, London, Moscow, Twitter, Facebook, pizza, fish, train, and car, according to Word2vec embeddings.

<img src="https://user.oc-static.com/upload/2021/01/11/16103734115528_P3C1-1%20%281%29.png" alt="drawing" width="600"/>

As you can see, similar words keep their semantic distance! Truly amazing!

With embeddings, it also becomes possible to capture analogies between words. For example, a woman is to a queen what a man is to a king; Paris is to France what Berlin is to Germany. You can also add and subtract words.

queen−→−−−−woman−→−−−−=king−→−−man−→−France−→−−−−−Paris−→−−=Germany−→−−−−−−Berlin−→−−−  

In this case, the distance between the respective vectors for woman and queen is close to the distance between the vectors for man and king.

### 4.3.2. They Have Dense Vectors
Word embeddings are dense vectors, meaning that all values are non-zero (except for the occasional element). Therefore, more information is given to the model, leading to better performances.

### 4.3.3. They Have a Constant Vector Size
With word embeddings, the vector size is no longer dependent on the number of documents in your corpus!

When training embedding models, the dimension of the word vector is a parameter of the model. You decide beforehand what vector size you need to represent each word. Pre-trained embeddings usually come in dimensions 50, 100, and 300.

### 4.3.4. Their Vector Representations are Absolute
The vector representations are also independent on the nature and content of the corpus you are working on.

Word embeddings are trained on gigantic datasets. Word2vec, for instance, was trained on a Google News dataset of 100 billion words, GloVe on a dataset of 6 billion words, and fastText on 16 billion tokens. As a direct consequence, these models have very large vector representations. Word2vec has 3 million vectors, GloVe has 400.000, and fastText has 1 million vectors.


### 4.3.5. Pre-trained models in Spacy

Spacy has a number of different models of different sizes available for use, with models in 7 different languages (include English, Polish, German, Spanish, Portuguese, French, Italian, and Dutch), and of different sizes to suit your requirements. en_core_web_md includes 20k unique vectors with 300 dimensions.

In [7]:
import spacy

# Load the spacy model that you have installed
nlp = spacy.load('en_core_web_md')

# Process a sentence using the model
doc = nlp("This is some text that I am processing with Spacy")

# It's that simple - all of the vectors and words are assigned after this point
# Get the vector for 'text':
doc[3].vector

array([ 1.8153e+00, -3.0974e+00,  7.8781e+00,  1.7159e+00,  1.3492e+00,
       -4.6307e+00,  3.6709e+00, -8.5784e-02, -4.9755e+00, -8.4094e-01,
        1.0642e+01,  6.8609e+00, -9.2319e+00, -1.5872e-01, -3.8155e-01,
       -1.9255e-01,  3.3571e+00,  3.7723e+00,  1.3672e+00,  6.5571e+00,
       -6.5411e+00, -3.9489e-01, -5.2012e-01,  5.5753e-01, -3.4513e+00,
       -4.5028e+00, -1.5902e+00, -3.7582e+00, -4.8479e+00,  2.5768e+00,
       -7.2187e+00, -4.7998e+00, -1.8594e+00, -4.9777e-01, -2.4411e-01,
       -4.1268e+00, -3.4901e+00, -4.8338e+00,  4.3046e+00,  2.6234e+00,
       -4.4230e-02, -1.3608e-02, -8.8456e+00,  3.7733e+00,  2.6316e+00,
        3.4657e+00,  4.3546e+00,  1.1333e+00, -3.7832e+00, -5.7349e+00,
       -3.3476e+00, -1.0848e+00,  3.8662e+00, -1.7437e+00, -9.9700e-01,
        4.1109e+00,  1.0865e+00,  3.2447e+00,  1.9290e+00, -4.9990e+00,
        6.1250e+00,  3.9852e+00, -5.0349e+00,  2.2019e+00, -1.2268e+00,
        1.2217e+01, -1.9911e-01, -6.9239e+00, -1.4570e-01,  2.51

The vectors can be accessed directly using the .vector attribute of each processed token (word). The mean vector for the entire sentence is also calculated simply using .vector, providing a very convenient input for machine learning models based on sentences.

In [6]:
# Get the mean vector for the entire sentence (useful for sentence classification etc.)
doc.vector

array([ 0.30563405, -0.427069  ,  0.500811  , -1.6580696 , -0.94629   ,
       -0.23934703,  0.028577  ,  4.295377  , -3.8321986 ,  1.7672628 ,
        6.658099  ,  1.1359228 , -3.006612  , -0.50961006,  2.3700955 ,
        0.703461  ,  2.34051   , -0.3941861 , -1.369067  ,  0.59585994,
        1.3248701 ,  2.0000231 , -3.4256172 , -2.002854  , -2.0418718 ,
       -2.193501  , -1.819932  , -1.423071  , -2.0385802 , -0.10317302,
       -0.46430922, -0.14729555, -3.394214  , -0.04532397, -2.269256  ,
        0.17045009, -0.13494173,  0.932325  ,  5.7848444 ,  2.148113  ,
       -1.5996732 ,  2.2030263 , -0.79875904, -1.350127  , -1.3724102 ,
        2.8075051 ,  3.299128  , -2.831655  , -2.130708  ,  0.97382486,
        1.6305139 , -0.57544106,  2.093894  , -4.6533303 , -1.3734801 ,
        0.5552629 ,  2.419354  ,  2.0564501 ,  0.612147  ,  0.47596803,
        4.43481   ,  0.15596   , -1.7175634 ,  0.38855594, -0.89715827,
        2.3118348 , -2.541205  , -4.31619   ,  2.454495  ,  3.79

Semantic similarity can be also illustrated as followed:

In [8]:
tokens = nlp('lions cat pet')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

lions lions 1.0
lions cat 0.2866518199443817
lions pet 0.2095211297273636
cat lions 0.2866518199443817
cat cat 1.0
cat pet 0.732966423034668
pet lions 0.2095211297273636
pet cat 0.732966423034668
pet pet 1.0


### 4.3.6 Word2Vec explained

Word2Vec is a shallow, two-layer neural networks which is trained to reconstruct linguistic contexts of words.
It takes as its input a large corpus of words and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.
Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
Word2Vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text.
It comes in two flavors, the Continuous Bag-of-Words (CBOW) model and the Skip-Gram model.
Algorithmically, these models are similar.

<img src="https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/word2vec_diagrams.png" alt="drawing" width="600"/>


CBOW predicts target words (e.g. ‘mat’) from the surrounding context words (‘the cat sits on the’).
Statistically, it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets.

Skip-gram predicts surrounding context words from the target words (inverse of CBOW).
Statistically, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.

Word2Vec is a simple neural network with a single hidden layer, and like all neural networks, it has weights, and during training, its goal is to adjust those weights to reduce a loss function. However, Word2Vec is not going to be used for the task it was trained on, instead, we will just take its hidden weights, use them as our word embeddings, and toss the rest of the model.

The rows of the hidden layer weight matrix, are actually the word vectors (word embeddings) we want!

<img src="https://israelg99.github.io/images/2017-03-23-Word2Vec-Explained/word2vec_weight_matrix_lookup_table.png" alt="drawing" width="600"/>



The above explanation is a very basic one. It just gives you a high-level idea of what word embeddings are and how Word2Vec works. There’s a lot more to it. For example, to make the algorithm computationally more efficient, tricks like Hierarchical Softmax and Skip-Gram Negative Sampling are used. Moreover, GloVe, which extends the work of Word2Vec to capture global contextual information in a text corpus by calculating a global word-word co-occurrence matrix, and FastText, which works with sub-word tokenization and, as a consequence, can handle out-of-vocabulary words, are also worth looking at. Finally, Google’s Bidirectional Encoder Representations from Transformer (BERT), which became the highlight by the end of 2018 for achieving state-of-the-art performance in many NLP tasks, can also be used for word's embeddings.