In [23]:
from contractions import CONTRACTION_MAP
import re
import nltk
import string
from nltk.stem import WordNetLemmatizer

In [24]:
CORPUS = [
'the sky is blue',
'sky is blue and sky is beautiful',
'the beautiful sky is so blue',
'i love blue cheese'
]

new_doc = ['loving this blue sky today']

# Bag of Words Model

> Convert text documents into vectors that represents the frequency of all its distinct words, disregarding their order.

NOTE: 
- We can make it an n-gram Bag of Words model such that the vector represent the frequency of distinct n-gram.

Understanding N-grams
- Unigram: A single word. For example, in the sentence "the cat sat on the mat," the unigrams are ["the", "cat", "sat", "on", "the", "mat"].
- Bigram: A sequence of two consecutive words. For example, in the same sentence, the bigrams are ["the cat", "cat sat", "sat on", "on the", "the mat"].
- Trigram: A sequence of three consecutive words. For example, the trigrams are ["the cat sat", "cat sat on", "sat on the", "on the mat"].

In [25]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_extractor(corpus, ngram_range=(1,1)):
    vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features

# build bow vectorizer and get features
bow_vectorizer, bow_features = bow_extractor(CORPUS)
features = bow_features.todense()
print(features)

[[0 0 1 0 1 0 1 0 1]
 [1 1 1 0 2 0 2 0 0]
 [0 1 1 0 1 0 1 1 1]
 [0 0 1 1 0 1 0 0 0]]


In [26]:
# extract features from new document using built vectorizer
new_doc_features = np.array(bow_vectorizer.transform(new_doc).todense())
print(new_doc_features)

[[0 0 1 0 0 0 1 0 0]]


In [27]:
# print feature names

feature_names = bow_vectorizer.get_feature_names_out()
print(feature_names)

['and' 'beautiful' 'blue' 'cheese' 'is' 'love' 'sky' 'so' 'the']


In [28]:
# Display features as dataframe
import pandas as pd

def display_features(features, feature_names):
    df = pd.DataFrame(data=features,
                      columns=feature_names)
    
    return df
    
display_features(features, feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0,0,1,0,1,0,1,0,1
1,1,1,1,0,2,0,2,0,0
2,0,1,1,0,1,0,1,1,1
3,0,0,1,1,0,1,0,0,0


# TF-IDF

There are two components:
- Term Frequency (tf) => Frequency of a "term" in a particular document.

NOTE: Term frequency is computed in the Bag of Words model. It can be normalization or raw frequency form.

- Inverse Document Frequency (idf) => Logarithmic of total number of documents in our corpus per frequency of document (df) that contains the term.

NOTE: (1) Some people modify the idf by adding 1 to the document frequency to prevent division-by-zero, (2) Adding 1 to the idf to avoid zero idf, (3) It can be normalize form by dividing it with its Euclidean norm.

- TF-IDF = tf x idf

In [29]:
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

def tfidf_transformer(bow_matrix):
    transformer = TfidfTransformer(norm='l2',
                                   smooth_idf=True,
                                   use_idf=True)
    
    tfidf_matrix = transformer.fit_transform(bow_matrix)
    return transformer, tfidf_matrix

# build tfidf transformer and show train corpus tfidf features

tfidf_trans, tdidf_features = tfidf_transformer(bow_features)
features = np.round(tdidf_features.todense(), 2)
feature_names = bow_vectorizer.get_feature_names_out()
display_features(features, feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.4,0.0,0.49,0.0,0.49,0.0,0.6
1,0.44,0.35,0.23,0.0,0.56,0.0,0.56,0.0,0.0
2,0.0,0.43,0.29,0.0,0.35,0.0,0.35,0.55,0.43
3,0.0,0.0,0.35,0.66,0.0,0.66,0.0,0.0,0.0


In [30]:
# show tfidf features for new_doc using built tfidf transformer

nd_tfidf = tfidf_trans.transform(new_doc_features)
nd_features = np.round(nd_tfidf.todense(), 2)
display_features(nd_features, feature_names)

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.63,0.0,0.0,0.0,0.77,0.0,0.0


# Advanced Word Vectorization Models

Some advance word vectorization models:
- TF-IDF Weighted Averaged Word Vectors
- SpaCy Tokenizer
- BERT Tokenizer
- Byte Pair Encoding (BPE) Tokenizer
- SentencePiece Tokenizer
- GPT-3 Tokenizer