<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/NLP_D1_2_LC6_Vectorized_Text_Representations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><a target="_blank" href="https://learning.constructor.org/"><img src="https://drive.google.com/uc?id=1RNy-ds7KWXFs7YheGo9OQwO3OnpvRSU1" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center>Constructor Academy, 2024</center>

# Vectorized Text Representations

## Exploring Traditional Statistical Models

Feature Engineering is often known as the secret sauce to creating superior and better performing machine learning models.

Just one excellent feature could be your ticket to winning a Kaggle challenge or getting the best model to be deployed in the enterprise!

The importance of feature engineering is even more important for unstructured, textual data because we need to convert free flowing text into some numeric representations which can then be understood by machine learning algorithms.

Here we will explore the following feature engineering techniques:

+ Bag of Words Model (TF)
+ Bag of N-grams Model
+ TF-IDF Model
+ Document Similarity
+ Topic Model Features



## Prepare Corpus

Let’s now take a sample corpus of documents on which we will run most of our analyses in this article. A corpus is typically a collection of text documents usually belonging to one or more subjects or domains.

In [None]:
import numpy as np
import pandas as pd

pd.options.display.max_colwidth = 200

In [None]:
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'
]
labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']

In [None]:
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus,
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

Unnamed: 0,Document,Category
0,The sky is blue and beautiful.,weather
1,Love this blue and beautiful sky!,weather
2,The quick brown fox jumps over the lazy dog.,animals
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food
4,"I love green eggs, ham, sausages and bacon!",food
5,The brown fox is quick and the blue dog is lazy!,animals
6,The sky is very blue and the sky is very beautiful today,weather
7,The dog is lazy but the brown fox is quick!,animals


## Simple Preprocessing

Since the focus of this unit is on feature engineering, we will build a simple text pre-processor which focuses on removing special characters, extra whitespaces, digits, stopwords and lower casing the text corpus.

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk
import re

In [None]:
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, flags=re.I|re.A) # [^a-zA-Z\s] => remove any digits, special characters, symbols etc.
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [None]:
normalize_corpus = np.vectorize(normalize_document) # this is same as running the function for all documents in a for loop

norm_corpus = normalize_corpus(corpus)
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')

In [None]:
pd.DataFrame(norm_corpus, columns=['Document'])

Unnamed: 0,Document
0,sky blue beautiful
1,love blue beautiful sky
2,quick brown fox jumps lazy dog
3,kings breakfast sausages ham bacon eggs toast beans
4,love green eggs ham sausages bacon
5,brown fox quick blue dog lazy
6,sky blue sky beautiful today
7,dog lazy brown fox quick


In [None]:
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'
]

## Bag of Words Model

This is perhaps the most simple vector space representational model for unstructured text.

A vector space model is simply a mathematical model to represent unstructured text (or any other data) as numeric vectors, such that each dimension of the vector is a specific feature\attribute.

The bag of words model represents each text document as a numeric vector where each dimension is a specific word from the corpus and the value could be its frequency in the document, occurrence (denoted by 1 or 0) or even weighted values.

The model’s name is such because each document is represented literally as a ‘bag’ of its own words, disregarding word orders, sequences and grammar.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# get bag of words features in sparse format
# check docs to play around with min and max doc frequencies e.g min_df=0.05, max_df = 0.95 ; min_df = 3, max_df=120
cv = CountVectorizer()
cv_matrix = cv.fit_transform(norm_corpus)

In [None]:
cv_matrix

<8x20 sparse matrix of type '<class 'numpy.int64'>'
	with 42 stored elements in Compressed Sparse Row format>

In [None]:
cv.get_feature_names_out()

array(['bacon', 'beans', 'beautiful', 'blue', 'breakfast', 'brown', 'dog',
       'eggs', 'fox', 'green', 'ham', 'jumps', 'kings', 'lazy', 'love',
       'quick', 'sausages', 'sky', 'toast', 'today'], dtype=object)

In [None]:
# view non-zero feature positions in the sparse matrix
print(cv_matrix)

  (0, 17)	1
  (0, 3)	1
  (0, 2)	1
  (1, 17)	1
  (1, 3)	1
  (1, 2)	1
  (1, 14)	1
  (2, 15)	1
  (2, 5)	1
  (2, 8)	1
  (2, 11)	1
  (2, 13)	1
  (2, 6)	1
  (3, 12)	1
  (3, 4)	1
  (3, 16)	1
  (3, 10)	1
  (3, 0)	1
  (3, 7)	1
  (3, 18)	1
  (3, 1)	1
  (4, 14)	1
  (4, 16)	1
  (4, 10)	1
  (4, 0)	1
  (4, 7)	1
  (4, 9)	1
  (5, 3)	1
  (5, 15)	1
  (5, 5)	1
  (5, 8)	1
  (5, 13)	1
  (5, 6)	1
  (6, 17)	2
  (6, 3)	1
  (6, 2)	1
  (6, 19)	1
  (7, 15)	1
  (7, 5)	1
  (7, 8)	1
  (7, 13)	1
  (7, 6)	1


The feature matrix is traditionally represented as a sparse matrix since the number of features increase phenomenally with each document considering each distinct word becomes a feature.

The preceding output tells us for each (x, y) pair what is the total count.

Here x represents a document and y represents a specific word\feature and the value is the number of times y occurs in x.

We can leverage the following code to view the output in a dense matrix representation.

In [None]:
cv_matrix = cv_matrix.toarray() # not recommended when you have a lot of features or documents; good for readability
cv_matrix

array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]])

In [None]:
# get all unique words in the corpus
vocab = cv.get_feature_names_out()
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,0
3,1,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0
4,1,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,1,0,0,0
5,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0
6,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1
7,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0


In [None]:
# min_df = 0.05 -> remove words occuring in less than 5% of documents; remove very rare words
# max_df = 0.95 -> remove words occuring in more than 95% of documents; remove very common words e.g custom stopwords
cv1 = CountVectorizer(min_df=0.05, max_df=0.95)
cv1_matrix = cv1.fit_transform(norm_corpus)
cv1_matrix = cv1_matrix.toarray()
vocab = cv1.get_feature_names_out()
# show document feature vectors - no major change
pd.DataFrame(cv1_matrix, columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
2,0,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,0,0,0
3,1,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,1,0
4,1,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,1,0,0,0
5,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0
6,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,1
7,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,0,0


In [None]:
cv2 = CountVectorizer(min_df=0.15, max_df=0.9)  # min_df=0.15, max_df = 0.9
cv2_matrix = cv2.fit_transform(norm_corpus)
cv2_matrix = cv2_matrix.toarray()
vocab = cv2.get_feature_names_out()
# show document feature vectors - many features get dropped - feature selection
pd.DataFrame(cv2_matrix, columns=vocab)

Unnamed: 0,bacon,beautiful,blue,brown,dog,eggs,fox,ham,lazy,love,quick,sausages,sky
0,0,1,1,0,0,0,0,0,0,0,0,0,1
1,0,1,1,0,0,0,0,0,0,1,0,0,1
2,0,0,0,1,1,0,1,0,1,0,1,0,0
3,1,0,0,0,0,1,0,1,0,0,0,1,0
4,1,0,0,0,0,1,0,1,0,1,0,1,0
5,0,0,1,1,1,0,1,0,1,0,1,0,0
6,0,1,1,0,0,0,0,0,0,0,0,0,2
7,0,0,0,1,1,0,1,0,1,0,1,0,0


In [None]:
['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy',
 'sky blue sky beautiful today',
       'dog lazy brown fox quick']

['sky blue beautiful',
 'love blue beautiful sky',
 'quick brown fox jumps lazy dog',
 'kings breakfast sausages ham bacon eggs toast beans',
 'love green eggs ham sausages bacon',
 'brown fox quick blue dog lazy',
 'sky blue sky beautiful today',
 'dog lazy brown fox quick']

You can clearly see that each column or dimension in the feature vectors represents a word from the corpus and each row represents one of our documents.

The value in any cell, represents the number of times that word (represented by column) occurs in the specific document (represented by row).

## Bag of N-Grams Model

A word is just a single token, often known as a unigram or 1-gram. We already know that the Bag of Words model doesn’t consider order of words. But what if we also wanted to take into account phrases or collection of words which occur in a sequence?

N-grams help us achieve that. An N-gram is basically a collection of word tokens from a text document such that these tokens are contiguous and occur in a sequence.

Bi-grams indicate n-grams of order 2 (two words), Tri-grams indicate n-grams of order 3 (three words), and so on.

The Bag of N-Grams model is hence just an extension of the Bag of Words model so we can also leverage N-gram based features. The following example depicts bi-gram based features in each document feature vector.

In [None]:
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams
bv = CountVectorizer(ngram_range=(1,2))
bv_matrix = bv.fit_transform(norm_corpus)

bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names_out()
pd.DataFrame(bv_matrix, columns=vocab)

Unnamed: 0,bacon,bacon eggs,beans,beautiful,beautiful sky,beautiful today,blue,blue beautiful,blue dog,blue sky,...,quick brown,sausages,sausages bacon,sausages ham,sky,sky beautiful,sky blue,toast,toast beans,today
0,0,0,0,1,0,0,1,1,0,0,...,0,0,0,0,1,0,1,0,0,0
1,0,0,0,1,1,0,1,1,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,1,1,0
4,1,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,0,1,1,0,0,1,...,0,0,0,0,2,1,1,0,0,1
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# 1 and 2 grams for 'sky blue beautiful'

# 1-grams - 'sky', 'blue', 'beautiful'
# 2-grams - 'sky blue', 'blue beautiful'

## TF-IDF Model

There are some potential problems which might arise with the Bag of Words model when it is used on large corpora. Since the feature vectors are based on absolute term frequencies, there might be some terms which occur frequently across all documents and these may tend to overshadow other terms in the feature set.

The TF-IDF model tries to combat this issue by using a scaling or normalizing factor in its computation. TF-IDF stands for Term Frequency-Inverse Document Frequency, which uses a combination of two metrics in its computation, namely: term frequency (tf) and inverse document frequency (idf).

This technique was developed for ranking results for queries in search engines and now it is an indispensable model in the world of information retrieval and NLP.

Mathematically, we can define TF-IDF as $$tfidf = tf \times idf$$

which can be expanded further to be represented as follows.



Here, $tf-idf(w, D)$ is the TF-IDF score for word $w$ in document $D$.

- The term $tf(w, D)$ represents the term frequency of the word $w$ in document $D$, which can be obtained from the Bag of Words model.
- The term $idf(w, D)$ is the inverse document frequency for the term $w$, which can be computed as the log transform of the total number of documents in the corpus $C$ divided by the document frequency of the word $w$, which is basically the frequency of documents in the corpus where the word w occurs (number of times word $w$ occurs across all documents).

In most implementations, the tfidf matrix is normalized by dividing it with the L2 norm of the matrix also known as the Euclidean norm (which is the square root of the sum of the square of each term's tfidf weight)

There are multiple variants of this model but they all end up giving quite similar results. Let’s apply this on our corpus now!

### Using TF-IDF Vectorizer

The TfidfVectorizer by scikit-learn enables us to directly compute the tfidf vectors by taking the raw documents themselves as input and internally computing the term frequencies as well as the inverse document frequencies

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tv = TfidfVectorizer(use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names_out()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


## Extracting Features for New Documents

Suppose you have built a machine learning model to classify and categorize news articles and it in currently in production.

How will you generate features for completely new documents so that we can feed it into our machine learning models for predictions?

The scikit-learn API provides the `transform(…)` function for the vectorizers we discussed previously and we can leverage the same to get features for a completely new document which was not present in our corpus previously (when we trained our model in the past).

In [None]:
tv.get_feature_names_out()

array(['bacon', 'beans', 'beautiful', 'blue', 'breakfast', 'brown', 'dog',
       'eggs', 'fox', 'green', 'ham', 'jumps', 'kings', 'lazy', 'love',
       'quick', 'sausages', 'sky', 'toast', 'today'], dtype=object)

In [None]:
new_doc = 'the sky is green orange today'
new_doc

'the sky is green orange today'

In [None]:
tv.transform([new_doc]).toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.62956522,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.4552969 , 0.        , 0.62956522]])

In [None]:
# TF-IDF feature vector for the document - out of training vocabulary words like 'orange' are ignored
pd.DataFrame(np.round(tv.transform([new_doc]).toarray(), 2),
             columns=tv.get_feature_names_out())

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.63,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.46,0.0,0.63


In [None]:
# BOW feature vector for the document - out of training vocabulary words like 'orange' are ignored
new_doc = 'the sky is green orange today'
pd.DataFrame(np.round(cv.transform([new_doc]).toarray(), 2),
             columns=cv.get_feature_names_out())

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1


```
X_train = df_train[['sqft', 'numrooms']]
X_test = df_test[['sqft', 'numrooms']]

ss = StandardScaler()

x - mean
------
sd

ss.fit(X_train) # learning what is the mean and sd of sqft and numrooms based on X_train
X_train_scaled = ss.transform(X_train) # applying x - mean / sd for sqft and numrooms

=

X_train_scaled = ss.fit_transform(X_train) # learn mean and sd and then scale and transform train data

X_test_scaled = ss.transform(X_test)
```

Just like standard scaler in ML, same process for vectorized features in NLP:

 - `fit` or `fit_transform` - use only on training data and learn the unique words == vocabulary
 - `transform` - use on validation \ test \ any new text data to transform into vectors ONLY using words which were there in training data vocabulary which was captured during the `fit_transform` function. Any new words never seen in training data will be ignored and not considered as features

## Document Similarity

Document similarity is the process of using a distance or similarity based metric that can be used to identify how similar a text document is with any other document(s) based on features extracted from the documents like bag of words or tf-idf.

Thus you can see that we can build on top of the tf-idf based features we engineered in the previous section and use them to generate new features which can be useful in domains like search engines, document clustering and information retrieval by leveraging these similarity based features.

Pairwise document similarity in a corpus involves computing document similarity for each pair of documents in a corpus. Thus if you have C documents in a corpus, you would end up with a C x C matrix such that each row and column represents the similarity score for a pair of documents, which represent the indices at the row and column, respectively. There are several similarity and distance metrics that are used to compute document similarity.

These include cosine distance/similarity, euclidean distance, manhattan distance, BM25 similarity, jaccard distance and so on. In our analysis, we will be using perhaps the most popular and widely used similarity metric, cosine similarity and compare pairwise document similarity based on their TF-IDF feature vectors.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,0.820599,0.0,0.0,0.0,0.192353,0.817246,0.0
1,0.820599,1.0,0.0,0.0,0.225489,0.157845,0.670631,0.0
2,0.0,0.0,1.0,0.0,0.0,0.791821,0.0,0.850516
3,0.0,0.0,0.0,1.0,0.506866,0.0,0.0,0.0
4,0.0,0.225489,0.0,0.506866,1.0,0.0,0.0,0.0
5,0.192353,0.157845,0.791821,0.0,0.0,1.0,0.115488,0.930989
6,0.817246,0.670631,0.0,0.0,0.0,0.115488,1.0,0.0
7,0.0,0.0,0.850516,0.0,0.0,0.930989,0.0,1.0


Cosine similarity basically gives us a metric representing the cosine of the angle between the feature vector representations of two text documents. Lower the angle between the documents, the closer and more similar they are as depicted in the following figure.



Looking closely at the similarity matrix clearly tells us that documents (0, 1 and 6), (2, 5 and 7) are very similar to one another and documents 3 and 4 are slightly similar to each other but the magnitude is not very strong, however still stronger than the other documents. This must indicate these similar documents have some similar features. This is a perfect example of grouping or clustering that can be solved by unsupervised learning especially when you are dealing with huge corpora of millions of text documents.

For any document you can use this matrix and find the top similar documents easily. Check the following example

In [None]:
# sample document
norm_corpus[0]

'sky blue beautiful'

In [None]:
# similarity scores for all documents to document 0
similarity_matrix[0,:]

array([1.        , 0.82059862, 0.        , 0.        , 0.        ,
       0.19235302, 0.81724648, 0.        ])

In [None]:
# get the top 2 similar document IDs \ row numbers
np.argsort(-similarity_matrix[0,:])[1:3]

array([1, 6])

In [None]:
# show the top 2 similar documents
norm_corpus[np.argsort(-similarity_matrix[0,:])[1:3]]

array(['love blue beautiful sky', 'sky blue sky beautiful today'],
      dtype='<U51')

In [None]:
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')

## Bonus: Document Clustering on Similarity Features

We will use a very popular partition based clustering method, K-means clustering to cluster or group these documents based on their similarity based feature representations.

In K-means clustering, we have an input parameter k, which specifies the number of clusters it will output using the document features. This clustering method is a centroid based clustering method, where it tries to cluster these documents into clusters of equal variance. It tries to create these clusters by minimizing the within-cluster sum of squares measure, also known as inertia.

In [None]:
from sklearn.cluster import KMeans

In [None]:
km = KMeans(n_clusters=3, random_state=0)
km.fit_transform(similarity_matrix)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)



Unnamed: 0,Document,Category,ClusterLabel
0,The sky is blue and beautiful.,weather,2
1,Love this blue and beautiful sky!,weather,2
2,The quick brown fox jumps over the lazy dog.,animals,1
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food,0
4,"I love green eggs, ham, sausages and bacon!",food,0
5,The brown fox is quick and the blue dog is lazy!,animals,1
6,The sky is very blue and the sky is very beautiful today,weather,2
7,The dog is lazy but the brown fox is quick!,animals,1


Thus you can clearly see our algorithm has correctly identified the three distinct categories in our documents based on the cluster labels assigned to them. This should give you a good idea of how our TF-IDF features were leveraged to build our similarity features which in turn helped in clustering our documents.

## Topic Model Features

While we will be covering topic modeling in detail in a separate module, a discussion about feature engineering is not complete without talking about topic models.

The idea of topic models revolves around the process of extracting key themes or concepts from a corpus of documents which are represented as topics.

Each topic can be represented as a bag or collection of words/terms from the document corpus. Together, these terms signify a specific topic, theme or a concept and each topic can be easily distinguished from other topics by virtue of the semantic meaning conveyed by these terms.


![](https://i.imgur.com/MeroIZm.png)


There are topic models like Latent Dirichlet Allocation (LDA), which uses a generative probabilistic model where each document consists of a combination of several topics and each term or word can be assigned to a specific topic.

For the purpose of feature engineering which is the intent of this tutorial, you need to remember that when LDA is applied on a document-term matrix (TF-IDF or Bag of Words feature matrix), it gets decomposed into two main components.


- A document-topic matrix, which would be the feature matrix we
are looking for.
- A topic-term matrix, which helps us in looking at potential topics in the corpus.


Let’s now leverage scikit-learn to get the document-topic matrix as follows, which can be used as features for any subsequent modeling requirements.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# We know here our corpus has three types of topics.
# Usually you might have to play around with this number (hyperparameter)
lda = LatentDirichletAllocation(n_components=3, max_iter=10000, random_state=0)
dt_matrix = lda.fit_transform(cv_matrix)
dt_matrix

array([[0.83219089, 0.08347968, 0.08432943],
       [0.8635544 , 0.06910008, 0.06734552],
       [0.04779351, 0.0477762 , 0.90443028],
       [0.0372428 , 0.92555906, 0.03719814],
       [0.04912129, 0.90307645, 0.04780226],
       [0.05490174, 0.04777755, 0.89732072],
       [0.88828704, 0.05569712, 0.05601584],
       [0.05570386, 0.05568915, 0.88860699]])

In [None]:
# Document - Topic Matrix -> shows how dominant each topic is per document
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2', 'T3'])
features

Unnamed: 0,T1,T2,T3
0,0.832191,0.08348,0.084329
1,0.863554,0.0691,0.067346
2,0.047794,0.047776,0.90443
3,0.037243,0.925559,0.037198
4,0.049121,0.903076,0.047802
5,0.054902,0.047778,0.897321
6,0.888287,0.055697,0.056016
7,0.055704,0.055689,0.888607


In [None]:
pd.concat([corpus_df, features], axis=1)

Unnamed: 0,Document,Category,T1,T2,T3
0,The sky is blue and beautiful.,weather,0.832191,0.08348,0.084329
1,Love this blue and beautiful sky!,weather,0.863554,0.0691,0.067346
2,The quick brown fox jumps over the lazy dog.,animals,0.047794,0.047776,0.90443
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food,0.037243,0.925559,0.037198
4,"I love green eggs, ham, sausages and bacon!",food,0.049121,0.903076,0.047802
5,The brown fox is quick and the blue dog is lazy!,animals,0.054902,0.047778,0.897321
6,The sky is very blue and the sky is very beautiful today,weather,0.888287,0.055697,0.056016
7,The dog is lazy but the brown fox is quick!,animals,0.055704,0.055689,0.888607


You can clearly see which documents contribute the most to which of the three topics in the above output. You can view the topics and their main constituents as follows.

In [None]:
cv.get_feature_names_out()

array(['bacon', 'beans', 'beautiful', 'blue', 'breakfast', 'brown', 'dog',
       'eggs', 'fox', 'green', 'ham', 'jumps', 'kings', 'lazy', 'love',
       'quick', 'sausages', 'sky', 'toast', 'today'], dtype=object)

In [None]:
tt_matrix = lda.components_
tt_matrix

array([[0.33369862, 0.33364718, 3.33236505, 3.37377425, 0.33364718,
        0.33389123, 0.33389123, 0.33369862, 0.33389123, 0.33379313,
        0.33369862, 0.33381403, 0.33364718, 0.33389123, 1.33041582,
        0.33389123, 0.33369862, 4.33243944, 0.33364718, 1.33255799],
       [2.33269587, 1.33277352, 0.33385287, 0.33428343, 1.33277352,
        0.33376141, 0.33376141, 2.33269587, 0.33376141, 1.33254315,
        2.33269587, 0.33376659, 1.33277352, 0.33376141, 1.33546105,
        0.33376141, 2.33269587, 0.3338117 , 1.33277352, 0.33374387],
       [0.33360551, 0.3335793 , 0.33378208, 1.29194231, 0.3335793 ,
        3.33234735, 3.33234735, 0.33360551, 3.33234735, 0.33366372,
        0.33360551, 1.33241938, 0.3335793 , 3.33234735, 0.33412313,
        3.33234735, 0.33360551, 0.33374886, 0.3335793 , 0.33369814]])

In [None]:
# Topic - Term Matrix -> shows the most important terms (words) per topic

tt_matrix = lda.components_
for topic_weights in tt_matrix:
    topic = [(token, round(weight, 3)) for token, weight in zip(vocab, topic_weights)]
    topic = sorted(topic, key=lambda x: -x[1])
    topic = [item for item in topic if item[1] > 0.6]
    print(topic)
    print()

[('sky', 4.332), ('blue', 3.374), ('beautiful', 3.332), ('today', 1.333), ('love', 1.33)]

[('bacon', 2.333), ('eggs', 2.333), ('ham', 2.333), ('sausages', 2.333), ('love', 1.335), ('beans', 1.333), ('breakfast', 1.333), ('green', 1.333), ('kings', 1.333), ('toast', 1.333)]

[('brown', 3.332), ('dog', 3.332), ('fox', 3.332), ('lazy', 3.332), ('quick', 3.332), ('jumps', 1.332), ('blue', 1.292)]



Thus you can clearly see the three topics are quite distinguishable from each other based on their constituent terms, first one talking about weather, second one about food and the last one about animals. Choosing the number of topics for topic modeling is an entire topic on its own (pun not intended!) and is an art as well as a science.