# **Working with Text Data - Text Transformation (AKA Vectorization)**
**AKA Feature Extraction | Engineering | Transformation (Text to Numerical Vector)**

### **Text Data**

Text Analysis is a major application field for machine learning algorithms. Some of the major application areas of NLP are:
1. Spell Checker, Keyword Search, etc
2. Sentiment Analysis, Spam Classification
3. Machine Translation
4. Chatbots/Dialog Systems
5. Question Answering Systems
etc..

However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

### **Why NLP is hard?**
1. Complexity of representation
> Poems, Sarcasm, etc...  
> Example 1: This task is a piece of cake.  
> Example 2: You have a football game tomorrow. Break a leg!

2. Ambiguity in Natural Language
> Ambiguity means uncertainity of meaning.  
> For Example: The car hit the pole while it was moving.

### **Vectorization Techniques (AKA Feature Engineering or Extraction - Convert Text to Numerical Vectors)**

1. Bag of Words
2. TF IDF (Term Frequency - Inverse Document Frequency)
3. Word2Vec (by Google)
4. GloVe (Global Vectors by Stanford)
5. FastText (by Facebook)
6. ELMo (Embeddings from Language Models)
7. GPT (Generative Pre-trained Transformer by OpenAI)
8. BERT (Bidirectional Encoder Representations from Transformer by Google)
9. LLM's

**Only the following technniques are covered in this notebook:**
1. Bag of Words
2. TF IDF (Term Frequency - Inverse Document Frequency)

## **Bag of Word Representation**

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.



We will use `CountVectorizer` to **convert text into a matrix of token count**.

**We are going to perform below mentioned steps to understand the entire process:**  
a. Converting text to numerical vectors with the help of `CountVectorizer`  
b. Understand `fit` and `transform`  
c. Looking at `vocabulary_`  
d. Converting sparse matrix to dense matrix using `toarray()`  
e. Understanding `n_gram`  

### **Advantages**
1. It is simple to understand and implement like OneHotEncoding.
2. We have a fixed length encoding for any sequence of arbitrary length.
3. Documents with same words/vocabulary will have similar representation. So if two documents have a similar vocabulary, they’ll be closer to each other in the vector space and vice versa.

### **Disadvantages**
1. The size of vector increases with the size of the vocabulary. Thus, sparsity continues to be a problem. One way to control it is by limiting the vocabulary to n number of the most frequent words.
2. It does not capture the similarity between different words that mean the same thing. i.e. Semantic Meaning is not captured.
> a. "walk", "walked", and "walking". BoW vectors of all three tokens will be equally apart.  
> b. "search" and "explore" are synonyms. BoW won't capture the semantic similarity of these words.
3. This representation does not have any way to handle out of vocabulary (OOV) words (i.e., new words that were not seen in the corpus that was used to build the vectorizer).
4. As the name indicates, it is a “bag” of words. Word order information is lost in this representation. One way to control it is by using n-grams.
5. It suffers from **curse of high dimensionality.**

### **a. BoW Text Vectorization: Apply CountVectorizer**

**Parameters**
1. encoding: str, default='utf-8'
2. decode_error: {'strict', 'ignore', 'replace'}, default='strict' 
    - Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`.
3. token_pattern: str or None, default=r"(?u)\b\w\w+\b"
    - Regular expression denoting what constitutes a "token". The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). Let's break down this default pattern:
        - **(?u)**: This flag makes the pattern Unicode aware.
        - **\b**: Word boundary. It prevents partial matching of pattern in long sequences.
        - **\w\w+**: Match any word character (equivalent to [a-zA-Z0-9_]), at least two times (\w means a single word character, and \w+ means one or more word characters).
        - **\b**: Another word boundary.
    - If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.
4. tokenizer: callable, default=None
    - Override the string tokenization step while preserving the preprocessing and n-grams generation steps.
5. ngram_range: tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
6. strip_accents: {'ascii', 'unicode'}, default=None
    - Remove accents and perform other character normalization during the preprocessing step.
7. lowercase: bool, default=True
    - Convert all characters to lowercase before tokenizing.
8. preprocessor: callable, default=None
    - Override the preprocessing (strip_accents and lowercase) stage while preserving the tokenizing and n-grams generation steps.
9. stop_words: {'english'}, a list, default=None
    - 'english': Use a built-in list of English stop words.
    - A list: Provide your own list of stop words.

In [2]:
# Import the CountVectorizer i.e. Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

text = ["This is a test", "with numbers 123", "and symbols #!"]

# initialize the object
vectorizer = CountVectorizer()

# Fit and transform
num_rep = vectorizer.fit_transform(text)

print("Shape:", num_rep.shape)
print("Type of Numerical Representation:", type(num_rep))
print("Vocabulary learned:", vectorizer.get_feature_names_out())

Shape: (3, 8)
Type of Numerical Representation: <class 'scipy.sparse._csr.csr_matrix'>
Vocabulary learned: ['123' 'and' 'is' 'numbers' 'symbols' 'test' 'this' 'with']


#### **Important Observation:**  
1. By default CountVectorizer took care of removing special characters.
2. It only reads alphanumeric characters and tokenize only if the length is greater than or equal to 2.

In [3]:
import pandas as pd

lst_text = ["We are Learning Machine Learning $", 
            "Processing natural - language data.", 
            "10 machine - learning algorithms.", 
            "we Are Mimicing natural intelligence"]

df = pd.DataFrame({'text': lst_text})

df.head()

Unnamed: 0,text
0,We are Learning Machine Learning $
1,Processing natural - language data.
2,10 machine - learning algorithms.
3,we Are Mimicing natural intelligence


In [4]:
# In the next section: Custom Text Cleaning
# we will study a problem with this approach
def tokenizer(doc):
    return doc.split()

In [5]:
# Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
bow_vect = CountVectorizer(token_pattern=None,
                           tokenizer=tokenizer,
                           ngram_range=(1, 1), 
                           lowercase=False, 
                           preprocessor=None, 
                           stop_words=None)

# fit_transform() does two functions: 
# First, it fits and learns the vocabulary 
# Second, it transforms our training data into feature vectors
# The input to fit_transform should be a list of strings
dtm = bow_vect.fit_transform(df['text'])

print(f"Shape of DTM (# of docs, # of unique vocabulary): {dtm.shape}")

print(f"Type of DTM (i.e. Compressed Sparse Row (CSR) format): {type(dtm)}")

Shape of DTM (# of docs, # of unique vocabulary): (4, 18)
Type of DTM (i.e. Compressed Sparse Row (CSR) format): <class 'scipy.sparse._csr.csr_matrix'>


In [6]:
dtm

<4x18 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

#### **A note on Sparsity**
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

In [7]:
# We can look at unique words by using 'vocabulary_'

print(f"Vocabulary size: {len(bow_vect.vocabulary_)}")
print()
print(f"Let's look at the vocabulary stored in the object: \n{bow_vect.vocabulary_}")
print()
print("Output Feature Names:", bow_vect.get_feature_names_out())

Vocabulary size: 18

Let's look at the vocabulary stored in the object: 
{'We': 8, 'are': 10, 'Learning': 4, 'Machine': 5, '$': 0, 'Processing': 7, 'natural': 16, '-': 1, 'language': 13, 'data.': 11, '10': 2, 'machine': 15, 'learning': 14, 'algorithms.': 9, 'we': 17, 'Are': 3, 'Mimicing': 6, 'intelligence': 12}

Output Feature Names: ['$' '-' '10' 'Are' 'Learning' 'Machine' 'Mimicing' 'Processing' 'We'
 'algorithms.' 'are' 'data.' 'intelligence' 'language' 'learning'
 'machine' 'natural' 'we']


In [8]:
# Since the dtm is sparse, lets convert it into numpy array

print(dtm.toarray())

[[1 0 0 0 2 1 0 0 1 0 1 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0]
 [0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0]
 [0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1]]


In [9]:
# Lets look at the Compressed Sparse Row (CSR) format

print(dtm)

  (0, 8)	1
  (0, 10)	1
  (0, 4)	2
  (0, 5)	1
  (0, 0)	1
  (1, 7)	1
  (1, 16)	1
  (1, 1)	1
  (1, 13)	1
  (1, 11)	1
  (2, 1)	1
  (2, 2)	1
  (2, 15)	1
  (2, 14)	1
  (2, 9)	1
  (3, 16)	1
  (3, 17)	1
  (3, 3)	1
  (3, 6)	1
  (3, 12)	1


In [10]:
# Converting the sparse matrix to a dataframe

pd.DataFrame(dtm.toarray(), 
             columns=bow_vect.get_feature_names_out())

Unnamed: 0,$,-,10,Are,Learning,Machine,Mimicing,Processing,We,algorithms.,are,data.,intelligence,language,learning,machine,natural,we
0,1,0,0,0,2,1,0,0,1,0,1,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0
2,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0
3,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1


### **b. BoW Text Vectorization: Apply CountVectorizer with `ngram_range=(1,2)` and `lowercase=False`**

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vect = CountVectorizer(token_pattern=None,
                           tokenizer=token,
                           ngram_range=(1, 2), 
                           lowercase=False, 
                           preprocessor=None, 
                           stop_words=None)

dtm = bow_vect.fit_transform(df['text'])

print(f"Shape of DTM (# of docs, # of unique vocabulary): {dtm.shape}")

print(f"Type of DTM (i.e. Compressed Sparse Row (CSR) format): {type(dtm)}")

Shape of DTM (# of docs, # of unique vocabulary): (4, 35)
Type of DTM (i.e. Compressed Sparse Row (CSR) format): <class 'scipy.sparse._csr.csr_matrix'>


In [12]:
dtm

<4x35 sparse matrix of type '<class 'numpy.int64'>'
	with 37 stored elements in Compressed Sparse Row format>

In [13]:
# We can look at unique words by using 'vocabulary_'

print(f"Vocabulary size: {len(bow_vect.vocabulary_)}")
print()
print(f"Let's look at the vocabulary stored in the object: \n{bow_vect.vocabulary_}")
print()
print("Output Feature Names:", bow_vect.get_feature_names_out())

Vocabulary size: 35

Let's look at the vocabulary stored in the object: 
{'We': 17, 'are': 20, 'Learning': 8, 'Machine': 11, '$': 0, 'We are': 18, 'are Learning': 21, 'Learning Machine': 10, 'Machine Learning': 12, 'Learning $': 9, 'Processing': 15, 'natural': 30, '-': 1, 'language': 24, 'data.': 22, 'Processing natural': 16, 'natural -': 31, '- language': 2, 'language data.': 25, '10': 4, 'machine': 28, 'learning': 26, 'algorithms.': 19, '10 machine': 5, 'machine -': 29, '- learning': 3, 'learning algorithms.': 27, 'we': 33, 'Are': 6, 'Mimicing': 13, 'intelligence': 23, 'we Are': 34, 'Are Mimicing': 7, 'Mimicing natural': 14, 'natural intelligence': 32}

Output Feature Names: ['$' '-' '- language' '- learning' '10' '10 machine' 'Are' 'Are Mimicing'
 'Learning' 'Learning $' 'Learning Machine' 'Machine' 'Machine Learning'
 'Mimicing' 'Mimicing natural' 'Processing' 'Processing natural' 'We'
 'We are' 'algorithms.' 'are' 'are Learning' 'data.' 'intelligence'
 'language' 'language data.' 

In [14]:
# convert sparse matrix to numpy array
print(dtm.toarray())

[[1 0 0 0 0 0 0 0 2 1 1 1 1 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 0]
 [0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0]
 [0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1]]


In [15]:
# Converting the sparse matrix to a dataframe

pd.DataFrame(dtm.toarray(), 
             columns=bow_vect.get_feature_names_out())

Unnamed: 0,$,-,- language,- learning,10,10 machine,Are,Are Mimicing,Learning,Learning $,...,language data.,learning,learning algorithms.,machine,machine -,natural,natural -,natural intelligence,we,we Are
0,1,0,0,0,0,0,0,0,2,1,...,0,0,0,0,0,0,0,0,0,0
1,0,1,1,0,0,0,0,0,0,0,...,1,0,0,0,0,1,1,0,0,0
2,0,1,0,1,1,1,0,0,0,0,...,0,1,1,1,1,0,0,0,0,0
3,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,1,0,1,1,1


## **Term Frequency - Inverse Document Frequency (TF IDF)**

In BOW approach all the words in the text are treated as equally important i.e. there's no notion of some words in the document being more important than others. TF-IDF, or term frequency-inverse document frequency, addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus.

***

$$ TF \ IDF = TF(word_i, doc_j) * IDF(word_i, corpus) $$

***

Let's now try to understand:
1. **Term Frequency**
    - Measures how frequently a term (word) appears in a document.

2. **Inverse Document Frequency**
    - Measures how important a term is within the entire corpus.
    - It decreases the weight of terms that appear in many documents and increases the weight of terms that appear in fewer documents.

***

$$ TF(word_i, doc_j) = \frac{No \ of \ time \ word_i \ occurs \ in \ doc_j}{Total \ no \ of \ words \ in \ doc_j} $$

$$ IDF(word_i, corpus) = \log(\frac{No \ of \ docs \ in \ corpus}{No \ of \ docs \ which \ contains \ word_i}) + 1 $$

**Note:** 
1. The effect of adding "1" to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.
2. Also note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]
3. log can be base `e`, or `2` or `10`, etc...

**To avoid zero division, the IDF formula is as follows:**
$$ IDF(word_i, corpus) = \log(\frac{1 \ + \ No \ of \ docs \ in \ corpus}{1 \ + \ No \ of \ docs \ which \ contains \ word_i}) + 1 $$

***

### **Advantages**
1. If the word is rare in the corpus, it will be given more importance. (i.e. IDF)
2. If the word is more frequent in a document, it will be given more importance. (i.e. TF)

### **Disadvantages**
> **Same as BOW**

### **TF IDF Text Vectorization: Apply TfidfVectorizer**

In [16]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b", 
                             ngram_range=(1, 1), 
                             lowercase=False, 
                             preprocessor=None, 
                             stop_words=None)

out = tfidf_vect.fit_transform(df['text'])

print(f"Shape of output (# of docs, # of unique vocabulary): {out.shape}")

print(f"Type of output (i.e. Compressed Sparse Row (CSR) format): {type(out)}")

Shape of output (# of docs, # of unique vocabulary): (4, 16)
Type of output (i.e. Compressed Sparse Row (CSR) format): <class 'scipy.sparse._csr.csr_matrix'>


In [17]:
out

<4x16 sparse matrix of type '<class 'numpy.float64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [18]:
# We can look at unique words by using 'vocabulary_'

print(f"Vocabulary size: {len(tfidf_vect.vocabulary_)}")
print()
print(f"Let's look at the vocabulary stored in the object: \n{tfidf_vect.vocabulary_}")
print()
print("Output Feature Names:", tfidf_vect.get_feature_names_out())

Vocabulary size: 16

Let's look at the vocabulary stored in the object: 
{'We': 6, 'are': 8, 'Learning': 2, 'Machine': 3, 'Processing': 5, 'natural': 14, 'language': 11, 'data': 9, '10': 0, 'machine': 13, 'learning': 12, 'algorithms': 7, 'we': 15, 'Are': 1, 'Mimicing': 4, 'intelligence': 10}

Output Feature Names: ['10' 'Are' 'Learning' 'Machine' 'Mimicing' 'Processing' 'We' 'algorithms'
 'are' 'data' 'intelligence' 'language' 'learning' 'machine' 'natural'
 'we']


In [19]:
# convert sparse matrix to nparray

print(out.toarray())

[[0.         0.         0.75592895 0.37796447 0.         0.
  0.37796447 0.         0.37796447 0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.52547275
  0.         0.         0.         0.52547275 0.         0.52547275
  0.         0.         0.41428875 0.        ]
 [0.5        0.         0.         0.         0.         0.
  0.         0.5        0.         0.         0.         0.
  0.5        0.5        0.         0.        ]
 [0.         0.46516193 0.         0.         0.46516193 0.
  0.         0.         0.         0.         0.46516193 0.
  0.         0.         0.36673901 0.46516193]]


In [20]:
# Converting the sparse matrix to a dataframe

pd.DataFrame(out.toarray(), 
             columns=tfidf_vect.get_feature_names_out())

Unnamed: 0,10,Are,Learning,Machine,Mimicing,Processing,We,algorithms,are,data,intelligence,language,learning,machine,natural,we
0,0.0,0.0,0.755929,0.377964,0.0,0.0,0.377964,0.0,0.377964,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.525473,0.0,0.0,0.0,0.525473,0.0,0.525473,0.0,0.0,0.414289,0.0
2,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.0
3,0.0,0.465162,0.0,0.0,0.465162,0.0,0.0,0.0,0.0,0.0,0.465162,0.0,0.0,0.0,0.366739,0.465162


## **End Note - How `CountVectorizer` and `TfidfVectorizer` works?**

1. Apply **preprocessing** if defined. This will clean the data.
2. As per **token_pattern** apply tokenization on the cleaned text data.
3. Learn the unique vocabulary word after tokenization.