# Text Representation (I)

In our previous notebook, we covered text preprocessing techniques essential for cleaning and preparing raw text data. Now, we will explore how to represent this preprocessed text in a format suitable for machine learning models.
Effective text representation is fundamental for developing robust NLP applications. In this notebook, we will explore:
- Bag of Words (BoW): A straightforward method using word counts for representation.
- TF-IDF (Term Frequency-Inverse Document Frequency): A technique that evaluates words based on their importance in documents.

Later on, we will discuss more advanced techniques such as word embeddings.
By the end of this session, you'll grasp these essential text representation techniques and their practical implementation. We will utilize the scikit-learn library to implement and apply these text representation techniques (https://scikit-learn.org/stable/).

## Bag of Words (BoW)

The Bag of Words (BoW) model is one of the simplest text representation techniques in NLP. It transforms text into fixed-length vectors by counting the occurrence of each word in the text. This approach ignores grammar and word order, focusing solely on word frequency. Let's walk through the implementation of the BoW model using the `scikit-learn` library and, more specifically, the `CountVectorizer` object.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

We will consider a couple of sentences as examples.


In [2]:
sample_sentences = [
    "My bunny Giovanna eats fennel and celery.",
    "Celery is a crunchy vegetable."
]

`CountVectorizer` is an object in `scikit-learn` that converts a collection of text documents to a matrix of token counts. A token for CountVectorizer is a single unit of text, typically a word, that is counted in the text documents.

In [3]:
vectorizer = CountVectorizer()

We fit the `CountVectorizer` to the sample sentences and transform these into a BoW representation.

In [4]:
X = vectorizer.fit_transform(sample_sentences)

`X` is a sparse matrix containing the token counts for each document generated by `CountVectorizer`. We use `X.toarray()` to convert this sparse matrix into a dense array for easier manipulation and visualization.

In [5]:
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
df_bow = pd.DataFrame(X_dense, columns=feature_names)

Let's see the result.

In [6]:
print("Result: \n\n", df_bow)

Result: 

    and  bunny  celery  crunchy  eats  fennel  giovanna  is  my  vegetable
0    1      1       1        0     1       1         1   0   1          0
1    0      0       1        1     0       0         0   1   0          1


The result is a dense matrix where each row represents a sentence, and each column corresponds to a lowercased token (word). The values indicate the frequency of each word in the respective sentence. For example, in the first sentence, "and", "bunny", "celery", "eats", "fennel", and "giovanna" each appear once, while "crunchy" and "vegetable" do not appear. In the second sentence, "celery", "crunchy", and "is" each appear once, while the other words do not.

**Bonus**: We can combine the BoW approach with the lemmatization technique learned in the previous notebook. What benefits could be obtained? Let's consider these two sentences.

In [7]:
sample_sentences = [
    "Giovanna loves to hop",
    "Giovanna is hopping"
]

Without lemmatization, everything is as we have already seen."

In [8]:
X = vectorizer.fit_transform(sample_sentences)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()
df_bow = pd.DataFrame(X_dense, columns=feature_names)
print("Result: \n\n", df_bow)

Result: 

    giovanna  hop  hopping  is  loves  to
0         1    1        0   0      1   1
1         1    0        1   1      0   0


In order to try the combination between BoW and Lemmatization, let's import `nltk`.

In [9]:
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import download
download('punkt')
download('wordnet')
download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

We now define the `LemmaTokenizer` class: it lemmatizes and tokenizes text, reducing words to their base forms to improve text representation in Bag of Words (BoW). The `LemmaTokenizer` class inherits from `BaseEstimator` and `TransformerMixin`, which are utility classes in scikit-learn that facilitate the creation of custom transformers and ensure compatibility with scikit-learn's pipeline framework. (It is assumed that the reader already knows how classes work in Python. Explaining how classes work in Python is not the focus of this notebook.)

In [10]:
from sklearn.base import BaseEstimator, TransformerMixin

class LemmaTokenizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def get_wordnet_pos(self, word):
        """Map POS tag to first character lemmatize() accepts"""
        tag = nltk.pos_tag([word])[0][1][0].upper()
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN)

    def __call__(self, doc):
        return [self.wnl.lemmatize(t, self.get_wordnet_pos(t)) for t in word_tokenize(doc)]

We can use `LemmaTokenizer` with `CountVectorizer` to preprocess text with lemmatization before creating the BoW matrix.

In [11]:
vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())
X = vectorizer.fit_transform(sample_sentences)
X_dense = X.toarray()
feature_names = vectorizer.get_feature_names_out()



Let's see the result.

In [12]:
df_bow = pd.DataFrame(X_dense, columns=feature_names)
print("Result: \n\n", df_bow)

Result: 

    be  giovanna  hop  love  to
0   0         1    1     1   1
1   1         1    1     0   0


We can note that with lemmatization, words like "hopping" are reduced to their base forms "hop," reducing feature dimensionality and merging similar terms. This simplifies the model and improves consistency.

## TF-IDF

After discussing the BoW model, let's now explore TF-IDF (Term Frequency-Inverse Document Frequency). TF-IDF addresses some limitations of BoW, such as improving relevance by weighting words based on their document frequency; however, certain limitations, such as the disregard for word order and the lack of semantic understanding, still remain. In other words, while BoW counts how often words appear in a document, TF-IDF also factors in how rare those words are in general. Indeed, the formula for TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF). TF measures how frequently a term appears in a document, normalized by the total number of terms in that document. IDF, on the other hand, measures how rare or common a term is across all documents in the corpus. By multiplying TF and IDF together, TF-IDF assigns higher weights to terms that are frequent in the document but rare across the corpus, helping to identify words that are uniquely significant to a specific document.
In mathematical terms:

- $TF(t, s) = \frac{\mbox{number of times term t appears in sentence s}}{\mbox{total number of terms in s}}$

- $IDF(t) = \log \bigg{(}\frac{\mbox{total number of sentences}}{\mbox{1 + number of sentences containing t}}\bigg{)}$

- $\mbox{TF-IDF}(t) = TF(t) \times IDF(t)$.

This helps to emphasize words that are distinctively important to a specific document compared to others in the corpus.

Again, we can implement TF-IDF using the `sklearn` library. Specifically, we'll utilize the `TfidfVectorizer` object.

 It's important to note that the `TfidfVectorizer` does not precisely use the previous formulas. Instead, it adjusts IDF calculations to ensure numerical stability and might incorporate additional optimizations for performance and effectiveness in real-world applications. For more details, refer to the documentation. However, the basic idea is the same.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

Let's consider the previous sample sentences.

In [14]:
sample_sentences = [
    "Giovanna loves to hop.",
    "Giovanna is hopping."
]

`TfIdfVectorizer`, like the `CountVectorizer`, is an object in `scikit-learn` that converts a collection of text documents to a matrix. However, in this case, the weight assigned to each token is determined by the TF-IDF metric rather than its count."

In [15]:
X_tfidf = tfidf_vectorizer.fit_transform(sample_sentences)

Let's display the result.

In [16]:
X_tfidf_dense = X_tfidf.toarray()
feature_names = tfidf_vectorizer.get_feature_names_out()
df_bow = pd.DataFrame(X_tfidf_dense, columns=feature_names)
print("Result: \n\n", df_bow)

Result: 

    giovanna       hop   hopping        is     loves        to
0  0.379978  0.534046  0.000000  0.000000  0.534046  0.534046
1  0.449436  0.000000  0.631667  0.631667  0.000000  0.000000


Is the result as expected? Let's consider a couple of examples:

- The term "giovanna" has less weight in sentence 0 than in sentence 1, which was expected because sentence 0 has more words, resulting in a lower TF.

- The term "hop" has more weight in sentence 0 compared to 'giovanna', which was expected because 'giovanna' is also present in sentence 1, resulting in a lower IDF.

We can further enhance the TF-IDF approach by combining it with lemmatization or other preprocessing techniques already discussed, to represent words in a more concise and meaningful manner.

## Conclusion

In this notebook, we have explored fundamental text representation techniques, starting with the simple yet effective Bag of Words (BoW) model and progressing to the more sophisticated TF-IDF approach. These methods have provided us with foundational tools for transforming text into numerical formats, which are essential for building initial models in Natural Language Processing (NLP).

Looking ahead, we plan to explore more complex and powerful representation techniques, including word embeddings, which offer richer and more nuanced interpretations of text data. Now that we have established methods for representing text numerically, we are well-prepared to begin exploring simple NLP models, paving the way for more advanced analytical applications.