In [2]:
import pandas as pd
import sklearn
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


## Feature extraction

1. Traditional techniques
    - Bag of N-Grams
    - TF-IDF model
2. Deep-learning based techniques
    - Word2Vec
    - Glove
    - FastText


### The traditional approach

As machine learning algorithms can only deal with numeric data we need to find a way to map
text documents to a representation in a numeric space. The traditional approach to create such
a map has been to create a vocabulary of the words (or a subset of the words) encountered in a corpus and
map each word to a one hot encoded vectors of the dimension of the vocabulary.

Consider a very small example of three documents containing a tiny vocabulary of just three words

$$
\text{doc1} = [\text{boy}, \text{mouse}] \\
\text{doc2} = [\text{mouse}] \\
\text{doc3} = [\text{zoo}, \text{boy}]\\
\text{doc4} = [\text{zoo}, \text{boy}, \text{zoo}]
$$

Consider the a very simple example with a vocabulary \["boy", "mouse", "zoo"\]. One hot encoding each word will result in the following three vectors

$$
\text{boy} \to
\left(
    \begin{array}{c}
        1 \\
        0\\
        0
    \end{array}
\right)
\quad
\text{mouse} \to
\left(
    \begin{array}{c}
        0 \\
        1\\
        0
    \end{array}
\right)
\quad
\text{zoo} \to
\left(
    \begin{array}{c}
        0 \\
        0\\
        1
    \end{array}
\right)
$$

XXX Depending on how we combine the word vectors to a document vector.
There are numerous possibilities to represent the documents in the a three dimensional space. A simple option is to sum the vectors in each document. This is what the the `CountVectorizer` does: the result is a $3 \times 3$ matrix of


In [3]:
toy_corpus = [
    "boy mouse boy",
    "mouse",
    "zoo boy",
    "zoo boy zoo"
]

In [4]:
c_vect = CountVectorizer()
term_matrix = c_vect.fit_transform(toy_corpus)

In [5]:
## Check the vocabulary learned from the corpus
c_vect.vocabulary_

{'boy': 0, 'mouse': 1, 'zoo': 2}

In [6]:
## Check the extracted features
c_vect.get_feature_names_out()

array(['boy', 'mouse', 'zoo'], dtype=object)

In [7]:
## Check the document representation
term_matrix_dense = term_matrix.toarray()
term_matrix_dense
pd.DataFrame(term_matrix_dense,
             columns=c_vect.get_feature_names_out(),
             index=[f"doc{i}" for i in range(1, len(toy_corpus) + 1)]
             )

Unnamed: 0,boy,mouse,zoo
doc1,2,1,0
doc2,0,1,0
doc3,1,0,1
doc4,1,0,2


Often another aggregation of the word vectors is preferred, where
the counts are weighted by the inverse document frequency.

**Term frequency** ($TF(i)$) is the number of occurrences of word $i$ in document $D$. It depends strongly
on how general a word is (e.g. "has" vs. "hexoxide" in general literature) and on the length of the document.

**Document frequency** ($DF(i)$ is the number of documents that contain word $i$.

**Inverse document frequency** ($IDF(i)$ is simply the the inverse relative frequency of the word in the set of documents.
With $N$ documents the IDF is given by:

$$
    IDF(i) = \frac{N}{DF(i)}
$$

It is large for words that occur on many documents and it will be small for words that appear in only a few documents.

A problem with this definition is that the IDF becomes very large for large corpora (large N) so it is commonly replaced
by its the logarithm.

$$
    IDF(i) = 1 + \log\left(\frac{N}{DF(i)}\right)
$$

The addition of 1 in the above equation serves to ensure that the words that occur in all documents are not entirely discarded. The default IDF used in `TfidfVectorizer` is:

$$
    IDF(i) = 1 + \log\left(\frac{N + 1}{DF(i) + 1}\right)
$$

$$
    \text{TF-IDF}(i, d) = TF(i, d) \times IDF(i)
$$

Let us calculate the document frequencies from the term density matrix we just created:

In [8]:
## First we limit the elements of the matrix to a maximum of 1,
## and the sum the matrix column-wise
doc_frequencies = np.clip(term_matrix_dense, None, 1).sum(axis=0)

pd.DataFrame(doc_frequencies, index=c_vect.get_feature_names_out())


Unnamed: 0,0
boy,3
mouse,2
zoo,2


From here we see that "boy" has a document frequency of 3 and that both "mouse" and "zoo" occur in two documents.
The relative frequency of "boy" is 3/4.


In [9]:
import math

print(math.log(4 / 3) + 1)
print(math.log(5 / 4) + 1)

1.2876820724517808
1.2231435513142097


In [12]:
tfidf_vect = TfidfVectorizer(smooth_idf=True, use_idf=True, norm=None)
tfidf_term_matrix = tfidf_vect.fit_transform(toy_corpus)
tfidf_vect.idf_

array([1.22314355, 1.51082562, 1.51082562])

In [11]:
pd.DataFrame(tfidf_term_matrix.toarray(),
             columns=tfidf_vect.get_feature_names_out(),
             index=[f"doc{i}" for i in range(1, len(toy_corpus) + 1)]
             )

Unnamed: 0,boy,mouse,zoo
doc1,2.575364,1.693147,0.0
doc2,0.0,1.693147,0.0
doc3,1.287682,0.0,1.693147
doc4,1.287682,0.0,3.386294


## Example

