<center><a target="_blank" href="https://www.womenplusplus.ch/deploy-impact/"><img src="https://drive.google.com/uc?id=1eOWdijaay1bv94PlIRMqNt7eWmUfAI2u" width="500" style="background:none; border:none; box-shadow:none;" /></a> </center>

<center><a target="_blank" href="https://learning.sit.org/"><img src="https://drive.google.com/uc?id=1x9_jQgLhozCSWDSaOdVxKmxOEAe_OLgV" width="250" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center> <h1> Live Coding - Vectorization, Similarity and Deduplication </h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

<center>SIT Learning, 2022</center>


# Vectorized Text Representations

## Exploring Traditional Statistical Models

Feature Engineering is often known as the secret sauce to creating superior and better performing machine learning models.

Just one excellent feature could be your ticket to winning a Kaggle challenge or getting the best model to be deployed in the enterprise! 

The importance of feature engineering is even more important for unstructured, textual data because we need to convert free flowing text into some numeric representations which can then be understood by machine learning algorithms.

Here we will explore the 
+ TF-IDF Model
+ Document Similarity 




## Prepare Corpus

Let’s now take a sample corpus of documents on which we will run most of our analyses in this article. A corpus is typically a collection of text documents usually belonging to one or more subjects or domains.

In [57]:
import numpy as np
import pandas as pd

pd.options.display.max_colwidth = 200

In [56]:
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']

In [25]:
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

Unnamed: 0,Document,Category
0,The sky is blue and beautiful.,weather
1,Love this blue and beautiful sky!,weather
2,The quick brown fox jumps over the lazy dog.,animals
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food
4,"I love green eggs, ham, sausages and bacon!",food
5,The brown fox is quick and the blue dog is lazy!,animals
6,The sky is very blue and the sky is very beautiful today,weather
7,The dog is lazy but the brown fox is quick!,animals


## Simple Preprocessing

Since the focus of this unit is on feature engineering, we will build a simple text pre-processor which focuses on removing special characters, extra whitespaces, digits, stopwords and lower casing the text corpus.

In [58]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [59]:
import nltk
import re

In [28]:
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, flags=re.I|re.A) # [^a-zA-Z\s] => remove any digits, special characters, symbols etc.
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [29]:
normalize_corpus = np.vectorize(normalize_document) # this is same as running the function for all documents in a for loop

norm_corpus = normalize_corpus(corpus)
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')

In [30]:
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
          'I love green eggs, ham, sausages and bacon!',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]

## TF-IDF Model

There are some potential problems which might arise with the Bag of Words model when it is used on large corpora. Since the feature vectors are based on absolute term frequencies, there might be some terms which occur frequently across all documents and these may tend to overshadow other terms in the feature set.

The TF-IDF model tries to combat this issue by using a scaling or normalizing factor in its computation. TF-IDF stands for Term Frequency-Inverse Document Frequency, which uses a combination of two metrics in its computation, namely: term frequency (tf) and inverse document frequency (idf). 

This technique was developed for ranking results for queries in search engines and now it is an indispensable model in the world of information retrieval and NLP.

Mathematically, we can define TF-IDF as $$tfidf = tf \times idf$$

which can be expanded further to be represented as follows.



Here, $tf-idf(w, D)$ is the TF-IDF score for word $w$ in document $D$.

- The term $tf(w, D)$ represents the term frequency of the word $w$ in document $D$, which can be obtained from the Bag of Words model.
- The term $idf(w, D)$ is the inverse document frequency for the term $w$, which can be computed as the log transform of the total number of documents in the corpus $C$ divided by the document frequency of the word $w$, which is basically the frequency of documents in the corpus where the word w occurs (number of times word $w$ occurs across all documents).

In most implementations, the tfidf matrix is normalized by dividing it with the L2 norm of the matrix also known as the Euclidean norm (which is the square root of the sum of the square of each term's tfidf weight)

There are multiple variants of this model but they all end up giving quite similar results. Let’s apply this on our corpus now!

### Using TF-IDF Vectorizer

You don't always need to generate features beforehand using a Bag of Words - count based model before engineering TF-IDF features. The TfidfVectorizer by scikit-learn enables us to directly compute the tfidf vectors by taking the raw documents themselves as input and internally computing the term frequencies as well as the inverse document frequencies 

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [34]:
tv = TfidfVectorizer(min_df=0., max_df=1., norm='l2', use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names_out()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.6,0.53,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0,0.0
1,0.0,0.0,0.49,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57,0.0,0.0,0.49,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.38,0.38,0.0,0.38,0.0,0.0,0.53,0.0,0.38,0.0,0.38,0.0,0.0,0.0,0.0
3,0.32,0.38,0.0,0.0,0.38,0.0,0.0,0.32,0.0,0.0,0.32,0.0,0.38,0.0,0.0,0.0,0.32,0.0,0.38,0.0
4,0.39,0.0,0.0,0.0,0.0,0.0,0.0,0.39,0.0,0.47,0.39,0.0,0.0,0.0,0.39,0.0,0.39,0.0,0.0,0.0
5,0.0,0.0,0.0,0.37,0.0,0.42,0.42,0.0,0.42,0.0,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.0,0.0,0.0
6,0.0,0.0,0.36,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.0,0.5
7,0.0,0.0,0.0,0.0,0.0,0.45,0.45,0.0,0.45,0.0,0.0,0.0,0.0,0.45,0.0,0.45,0.0,0.0,0.0,0.0


## Extracting Features for New Documents

Suppose you have built a machine learning model to classify and categorize news articles and it in currently in production. 

How will you generate features for completely new documents so that we can feed it into our machine learning models for predictions? 

The scikit-learn API provides the `transform(…)` function for the vectorizers we discussed previously and we can leverage the same to get features for a completely new document which was not present in our corpus previously (when we trained our model in the past).

In [35]:
tv.get_feature_names_out()

array(['bacon', 'beans', 'beautiful', 'blue', 'breakfast', 'brown', 'dog',
       'eggs', 'fox', 'green', 'ham', 'jumps', 'kings', 'lazy', 'love',
       'quick', 'sausages', 'sky', 'toast', 'today'], dtype=object)

In [39]:
new_doc = 'the sky is green orange today'
new_doc

'the sky is green orange today'

In [43]:
tv.transform([new_doc]).toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.62956522,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.4552969 , 0.        , 0.62956522]])

In [44]:
pd.DataFrame(np.round(tv.transform([new_doc]).toarray(), 2), 
             columns=tv.get_feature_names_out())

Unnamed: 0,bacon,beans,beautiful,blue,breakfast,brown,dog,eggs,fox,green,ham,jumps,kings,lazy,love,quick,sausages,sky,toast,today
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.63,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.46,0.0,0.63


## Document Similarity

Document similarity is the process of using a distance or similarity based metric that can be used to identify how similar a text document is with any other document(s) based on features extracted from the documents like bag of words or tf-idf.

Thus you can see that we can build on top of the tf-idf based features we engineered in the previous section and use them to generate new features which can be useful in domains like search engines, document clustering and information retrieval by leveraging these similarity based features.

Pairwise document similarity in a corpus involves computing document similarity for each pair of documents in a corpus. Thus if you have C documents in a corpus, you would end up with a C x C matrix such that each row and column represents the similarity score for a pair of documents, which represent the indices at the row and column, respectively. There are several similarity and distance metrics that are used to compute document similarity. 

These include cosine distance/similarity, euclidean distance, manhattan distance, BM25 similarity, jaccard distance and so on. In our analysis, we will be using perhaps the most popular and widely used similarity metric, cosine similarity and compare pairwise document similarity based on their TF-IDF feature vectors.

In [46]:
from sklearn.metrics.pairwise import cosine_similarity

In [47]:
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,0.820599,0.0,0.0,0.0,0.192353,0.817246,0.0
1,0.820599,1.0,0.0,0.0,0.225489,0.157845,0.670631,0.0
2,0.0,0.0,1.0,0.0,0.0,0.791821,0.0,0.850516
3,0.0,0.0,0.0,1.0,0.506866,0.0,0.0,0.0
4,0.0,0.225489,0.0,0.506866,1.0,0.0,0.0,0.0
5,0.192353,0.157845,0.791821,0.0,0.0,1.0,0.115488,0.930989
6,0.817246,0.670631,0.0,0.0,0.0,0.115488,1.0,0.0
7,0.0,0.0,0.850516,0.0,0.0,0.930989,0.0,1.0


In [61]:
norm_corpus[0]

'sky blue beautiful'

In [49]:
similarity_matrix[0,:]

array([1.        , 0.82059862, 0.        , 0.        , 0.        ,
       0.19235302, 0.81724648, 0.        ])

In [50]:
np.argsort(-similarity_matrix[0,:])[1:3]

array([1, 6])

In [51]:
norm_corpus[np.argsort(-similarity_matrix[0,:])[1:3]]

array(['love blue beautiful sky', 'sky blue sky beautiful today'],
      dtype='<U51')

In [52]:
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')

Cosine similarity basically gives us a metric representing the cosine of the angle between the feature vector representations of two text documents. Lower the angle between the documents, the closer and more similar they are as depicted in the following figure.



Looking closely at the similarity matrix clearly tells us that documents (0, 1 and 6), (2, 5 and 7) are very similar to one another and documents 3 and 4 are slightly similar to each other but the magnitude is not very strong, however still stronger than the other documents. This must indicate these similar documents have some similar features. This is a perfect example of grouping or clustering that can be solved by unsupervised learning especially when you are dealing with huge corpora of millions of text documents.

## Bonus: Document Clustering on Similarity Features

We will use a very popular partition based clustering method, K-means clustering to cluster or group these documents based on their similarity based feature representations. 

In K-means clustering, we have an input parameter k, which specifies the number of clusters it will output using the document features. This clustering method is a centroid based clustering method, where it tries to cluster these documents into clusters of equal variance. It tries to create these clusters by minimizing the within-cluster sum of squares measure, also known as inertia.

In [53]:
from sklearn.cluster import KMeans

In [54]:
km = KMeans(n_clusters=3, random_state=0)
km.fit_transform(similarity_matrix)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)

Unnamed: 0,Document,Category,ClusterLabel
0,The sky is blue and beautiful.,weather,2
1,Love this blue and beautiful sky!,weather,2
2,The quick brown fox jumps over the lazy dog.,animals,1
3,"A king's breakfast has sausages, ham, bacon, eggs, toast and beans",food,0
4,"I love green eggs, ham, sausages and bacon!",food,0
5,The brown fox is quick and the blue dog is lazy!,animals,1
6,The sky is very blue and the sky is very beautiful today,weather,2
7,The dog is lazy but the brown fox is quick!,animals,1


Thus you can clearly see our algorithm has correctly identified the three distinct categories in our documents based on the cluster labels assigned to them. This should give you a good idea of how our TF-IDF features were leveraged to build our similarity features which in turn helped in clustering our documents. 