# **Working with Text Data - Text Transformation (AKA Vectorization)**
**AKA Feature Extraction | Engineering | Transformation (Text to Numerical Vector)**

### **Text Data**

Text Analysis is a major application field for machine learning algorithms. Some of the major application areas of NLP are:
1. Spell Checker, Keyword Search, etc
2. Sentiment Analysis, Spam Classification
3. Machine Translation
4. Chatbots/Dialog Systems
5. Question Answering Systems
etc..

However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

### **Why NLP is hard?**
1. Complexity of representation
> Poems, Sarcasm, etc...  
> Example 1: This task is a piece of cake.  
> Example 2: You have a football game tomorrow. Break a leg!

2. Ambiguity in Natural Language
> Ambiguity means uncertainity of meaning.  
> For Example: The car hit the pole while it was moving.

### **Vectorization Techniques (AKA Feature Engineering or Extraction - Convert Text to Numerical Vectors)**

1. Bag of Words
2. TF IDF (Term Frequency - Inverse Document Frequency)
3. Word2Vec (by Google)
4. GloVe (Global Vectors by Stanford)
5. FastText (by Facebook)
6. ELMo (Embeddings from Language Models)
7. GPT (Generative Pre-trained Transformer by OpenAI)
8. BERT (Bidirectional Encoder Representations from Transformer by Google)
9. LLM's

**Only the following technniques are covered in this notebook:**
1. Bag of Words
2. TF IDF (Term Frequency - Inverse Document Frequency)

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
lst_text = ['it Was the best oF Times $', 
            'It was The worst of times.',
            'IT 9 was tHe age Of wisdom', 
            'it was thE age of foolishness']

df = pd.DataFrame({'text': lst_text})

df.head()

Unnamed: 0,text
0,it Was the best oF Times $
1,It was The worst of times.
2,IT 9 was tHe age Of wisdom
3,it was thE age of foolishness


## **Bag of Word Representation**

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

We will use `CountVectorizer` to **convert text into a matrix of token count**.

`Bag of Words`: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

`Code Example`: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/  

**We are going to perform below mentioned steps to understand the entire process:**  
a. Converting text to numerical vectors with the help of `CountVectorizer`  
b. Understand `fit` and `transform`  
c. Looking at `vocabulary_`  
d. Converting sparse matrix to dense matrix using `toarray()`  
e. Understanding `n_gram`  

### **Advantages**
1. It is simple to understand and implement like OneHotEncoding.
2. We have a fixed length encoding for any sequence of arbitrary length.
3. Documents with same words/vocabulary will have similar representation. So if two documents have a similar vocabulary, they’ll be closer to each other in the vector space and vice versa.

### **Disadvantages**
1. The size of vector increases with the size of the vocabulary. Thus, sparsity continues to be a problem. One way to control it is by limiting the vocabulary to n number of the most frequent words.
2. It does not capture the similarity between different words that mean the same thing. i.e. Semantic Meaning is not captured.
> a. "walk", "walked", and "walking". BoW vectors of all three tokens will be equally apart.  
> b. "search" and "explore" are synonyms. BoW won't capture the semantic similarity of these words.
3. This representation does not have any way to handle out of vocabulary (OOV) words (i.e., new words that were not seen in the corpus that was used to build the vectorizer).
4. As the name indicates, it is a “bag” of words. Word order information is lost in this representation. One way to control it is by using n-grams.
5. It suffers from **curse of high dimensionality.**

In [3]:
df.head()

Unnamed: 0,text
0,it Was the best oF Times $
1,It was The worst of times.
2,IT 9 was tHe age Of wisdom
3,it was thE age of foolishness


### **a. BoW Text Vectorization: Apply CountVectorizer**

**Parameters**
1. encoding: str, default='utf-8'
2. decode_error: {'strict', 'ignore', 'replace'}, default='strict' 
    - Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`.
3. token_pattern: str or None, default=r"(?u)\b\w\w+\b"
    - Regular expression denoting what constitutes a "token". The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
    - If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.
4. ngram_range: tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
5. strip_accents: {'ascii', 'unicode'}, default=None
    - Remove accents and perform other character normalization during the preprocessing step.
6. lowercase: bool, default=True
    - Convert all characters to lowercase before tokenizing.
7. preprocessor: callable, default=None
    - Override the preprocessing (strip_accents and lowercase) stage while preserving the tokenizing and n-grams generation steps.

In [4]:
# Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
bow_vect = CountVectorizer()

# fit_transform() does two functions: 
# First, it fits and learns the vocabulary 
# Second, it transforms our training data into feature vectors
# The input to fit_transform should be a list of strings
dtm = bow_vect.fit_transform(df['text'])

print(f"Shape of DTM (# of docs, # of unique vocabulary): {dtm.shape}")

print(f"Type of DTM (i.e. Compressed Sparse Row (CSR) format): {type(dtm)}")

Shape of DTM (# of docs, # of unique vocabulary): (4, 10)
Type of DTM (i.e. Compressed Sparse Row (CSR) format): <class 'scipy.sparse._csr.csr_matrix'>


In [5]:
dtm

<4x10 sparse matrix of type '<class 'numpy.int64'>'
	with 24 stored elements in Compressed Sparse Row format>

**Remember:**

- `bow_vect.fit(lst_text)` **learns the vocabulary**
- `bow_vect.transform(lst_text)` **uses the fitted vocabulary** to build a **document-term matrix**

In [6]:
# We can look at unique words by using 'vocabulary_'

print(f"Vocabulary size: {len(bow_vect.vocabulary_)}")
print()
print(f"Let's look at the vocabulary stored in the object: \n{bow_vect.vocabulary_}")
print()
print("Output Feature Names:", bow_vect.get_feature_names_out())

Vocabulary size: 10

Let's look at the vocabulary stored in the object: 
{'it': 3, 'was': 7, 'the': 5, 'best': 1, 'of': 4, 'times': 6, 'worst': 9, 'age': 0, 'wisdom': 8, 'foolishness': 2}

Output Feature Names: ['age' 'best' 'foolishness' 'it' 'of' 'the' 'times' 'was' 'wisdom' 'worst']


In [7]:
# Since the dtm is sparse, lets convert it into numpy array.

print(dtm.toarray())

[[0 1 0 1 1 1 1 1 0 0]
 [0 0 0 1 1 1 1 1 0 1]
 [1 0 0 1 1 1 0 1 1 0]
 [1 0 1 1 1 1 0 1 0 0]]


In [8]:
# Lets look at the Compressed Sparse Row (CSR) format

print(dtm)

  (0, 3)	1
  (0, 7)	1
  (0, 5)	1
  (0, 1)	1
  (0, 4)	1
  (0, 6)	1
  (1, 3)	1
  (1, 7)	1
  (1, 5)	1
  (1, 4)	1
  (1, 6)	1
  (1, 9)	1
  (2, 3)	1
  (2, 7)	1
  (2, 5)	1
  (2, 4)	1
  (2, 0)	1
  (2, 8)	1
  (3, 3)	1
  (3, 7)	1
  (3, 5)	1
  (3, 4)	1
  (3, 0)	1
  (3, 2)	1


In [9]:
# Converting the sparse matrix to a dataframe

pd.DataFrame(dtm.toarray(), 
             columns=bow_vect.get_feature_names_out())

Unnamed: 0,age,best,foolishness,it,of,the,times,was,wisdom,worst
0,0,1,0,1,1,1,1,1,0,0
1,0,0,0,1,1,1,1,1,0,1
2,1,0,0,1,1,1,0,1,1,0
3,1,0,1,1,1,1,0,1,0,0


### **b. BoW Text Vectorization: Apply CountVectorizer with `ngram_range=(1,2)` and `lowercase=False`**

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vect = CountVectorizer(ngram_range=(1,2), lowercase=False)

dtm = bow_vect.fit_transform(df['text'])

print(f"Shape of DTM (# of docs, # of unique vocabulary): {dtm.shape}")

print(f"Type of DTM (i.e. Compressed Sparse Row (CSR) format): {type(dtm)}")

Shape of DTM (# of docs, # of unique vocabulary): (4, 39)
Type of DTM (i.e. Compressed Sparse Row (CSR) format): <class 'scipy.sparse._csr.csr_matrix'>


In [11]:
dtm

<4x39 sparse matrix of type '<class 'numpy.int64'>'
	with 44 stored elements in Compressed Sparse Row format>

In [12]:
# We can look at unique words by using 'vocabulary_'

print(f"Vocabulary size: {len(bow_vect.vocabulary_)}")
print()
print(f"Let's look at the vocabulary stored in the object: \n{bow_vect.vocabulary_}")
print()
print("Output Feature Names:", bow_vect.get_feature_names_out())

Vocabulary size: 39

Let's look at the vocabulary stored in the object: 
{'it': 17, 'Was': 9, 'the': 29, 'best': 14, 'oF': 20, 'Times': 8, 'it Was': 18, 'Was the': 10, 'the best': 30, 'best oF': 15, 'oF Times': 21, 'It': 2, 'was': 32, 'The': 6, 'worst': 37, 'of': 22, 'times': 31, 'It was': 3, 'was The': 33, 'The worst': 7, 'worst of': 38, 'of times': 24, 'IT': 0, 'tHe': 25, 'age': 11, 'Of': 4, 'wisdom': 36, 'IT was': 1, 'was tHe': 34, 'tHe age': 26, 'age Of': 12, 'Of wisdom': 5, 'thE': 27, 'foolishness': 16, 'it was': 19, 'was thE': 35, 'thE age': 28, 'age of': 13, 'of foolishness': 23}

Output Feature Names: ['IT' 'IT was' 'It' 'It was' 'Of' 'Of wisdom' 'The' 'The worst' 'Times'
 'Was' 'Was the' 'age' 'age Of' 'age of' 'best' 'best oF' 'foolishness'
 'it' 'it Was' 'it was' 'oF' 'oF Times' 'of' 'of foolishness' 'of times'
 'tHe' 'tHe age' 'thE' 'thE age' 'the' 'the best' 'times' 'was' 'was The'
 'was tHe' 'was thE' 'wisdom' 'worst' 'worst of']


In [13]:
# convert sparse matrix to numpy array
print(dtm.toarray())

[[0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
  0 0 0]
 [0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0
  0 1 1]
 [1 1 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0
  1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1
  0 0 0]]


In [14]:
# Converting the sparse matrix to a dataframe

pd.DataFrame(dtm.toarray(), 
             columns=bow_vect.get_feature_names_out())

Unnamed: 0,IT,IT was,It,It was,Of,Of wisdom,The,The worst,Times,Was,...,the,the best,times,was,was The,was tHe,was thE,wisdom,worst,worst of
0,0,0,0,0,0,0,0,0,1,1,...,1,1,0,0,0,0,0,0,0,0
1,0,0,1,1,0,0,1,1,0,0,...,0,0,1,1,1,0,0,0,1,1
2,1,1,0,0,1,1,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0


## **Term Frequency - Inverse Document Frequency (TF IDF)**

In BOW approach all the words in the text are treated as equally important i.e. there's no notion of some words in the document being more important than others. TF-IDF, or term frequency-inverse document frequency, addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus.

***

Let's now try to understand:
1. Term Frequency  
2. Inverse Document Frequency

$$ TF \ IDF = TF(word_i, doc_j) * IDF(word_i, corpus) $$

$$ TF(word_i, doc_j) = \frac{No \ of \ time \ word_i \ occurs \ in \ doc_j}{Total \ no \ of \ words \ in \ doc_j} $$

$$ IDF(word_i, corpus) = \log_n(\frac{No \ of \ docs \ in \ corpus}{No \ of \ docs \ which \ contains \ word_i}) $$

***

### **Advantages**
1. If the word is rare in the corpus, it will be given more importance. (i.e. IDF)
2. If the word is more frequent in a document, it will be given more importance. (i.e. TF)

### **Disadvantages**
> **Same as BOW**

### **TF IDF Text Vectorization: Apply TfidfVectorizer**

In [15]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()

out = tfidf_vect.fit_transform(df['text'])

print(f"Shape of output (# of docs, # of unique vocabulary): {out.shape}")

print(f"Type of output (i.e. Compressed Sparse Row (CSR) format): {type(out)}")

Shape of output (# of docs, # of unique vocabulary): (4, 10)
Type of output (i.e. Compressed Sparse Row (CSR) format): <class 'scipy.sparse._csr.csr_matrix'>


In [16]:
out

<4x10 sparse matrix of type '<class 'numpy.float64'>'
	with 24 stored elements in Compressed Sparse Row format>

In [17]:
# We can look at unique words by using 'vocabulary_'

print(f"Vocabulary size: {len(tfidf_vect.vocabulary_)}")
print()
print(f"Let's look at the vocabulary stored in the object: \n{tfidf_vect.vocabulary_}")
print()
print("Output Feature Names:", tfidf_vect.get_feature_names_out())

Vocabulary size: 10

Let's look at the vocabulary stored in the object: 
{'it': 3, 'was': 7, 'the': 5, 'best': 1, 'of': 4, 'times': 6, 'worst': 9, 'age': 0, 'wisdom': 8, 'foolishness': 2}

Output Feature Names: ['age' 'best' 'foolishness' 'it' 'of' 'the' 'times' 'was' 'wisdom' 'worst']


In [18]:
# convert sparse matrix to nparray

print(out.toarray())

[[0.         0.60735961 0.         0.31694544 0.31694544 0.31694544
  0.4788493  0.31694544 0.         0.        ]
 [0.         0.         0.         0.31694544 0.31694544 0.31694544
  0.4788493  0.31694544 0.         0.60735961]
 [0.4788493  0.         0.         0.31694544 0.31694544 0.31694544
  0.         0.31694544 0.60735961 0.        ]
 [0.4788493  0.         0.60735961 0.31694544 0.31694544 0.31694544
  0.         0.31694544 0.         0.        ]]


In [19]:
# Converting the sparse matrix to a dataframe

pd.DataFrame(out.toarray(), 
             columns=tfidf_vect.get_feature_names_out())

Unnamed: 0,age,best,foolishness,it,of,the,times,was,wisdom,worst
0,0.0,0.60736,0.0,0.316945,0.316945,0.316945,0.478849,0.316945,0.0,0.0
1,0.0,0.0,0.0,0.316945,0.316945,0.316945,0.478849,0.316945,0.0,0.60736
2,0.478849,0.0,0.0,0.316945,0.316945,0.316945,0.0,0.316945,0.60736,0.0
3,0.478849,0.0,0.60736,0.316945,0.316945,0.316945,0.0,0.316945,0.0,0.0
