# Feature Extraction

In order to feed text to a model we need to transform it to a numerical features, in this notebook we will discuss how to build a bag-of-words model from text to use it later for different applications.

# Bag of words

Count the occurrences of words in the corpus.

In [3]:
# Import the necessary libraries
import pandas as pd                             # For handling data and creating DataFrames
from sklearn.feature_extraction.text import CountVectorizer  # For converting text to numerical features (BoW)

# Step 1: Define a small corpus (a list of short text documents)

In [10]:
texts = [
    'the red dog dog',
    'cat eats dog dog',
    'dog eats food',
    'red cat eats',
    'the hot dog'
]

# Step 2: Initialize the CountVectorizer

In [11]:
# This object will build a vocabulary and count the frequency of each word in the corpus
vectorizer = CountVectorizer()

# Step 3: Learn the vocabulary from the text data

In [12]:
# This step identifies all unique words and assigns an index to each
vectorizer.fit(texts)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"


# Step 4: Transform the original text data into a document-term matrix

In [14]:
# Each row represents a document, each column represents a word, and each cell shows word frequency
x = vectorizer.transform(texts) # create Document-Term Matrix (DTM) 
print(x.shape) #(Num_of_rows, Num_of_Cols)
print(x) # (document_index, word_index)	word_count

(5, 7)
  (0, 1)	2
  (0, 5)	1
  (0, 6)	1
  (1, 0)	1
  (1, 1)	2
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	1
  (3, 0)	1
  (3, 2)	1
  (3, 5)	1
  (4, 1)	1
  (4, 4)	1
  (4, 6)	1


# Step 5: Get the list of words (vocabulary) that CountVectorizer found

In [15]:
# (updated version: get_feature_names_out() instead of get_feature_names())
columns = vectorizer.get_feature_names_out()
print(columns)

['cat' 'dog' 'eats' 'food' 'hot' 'red' 'the']


# Step 6: Create a DataFrame to visualize the document-term matrix in a readable tabular format

In [16]:
# Rows = original texts, Columns = words from the vocabulary
pd.DataFrame(x.todense(), columns=columns, index=texts)

## Parameters:
# (1)
# x.todense() change Document-Term Matrix (DTM) to Dense Matrix
## Before: Sparse format (only non-zero values shown)
# (0, 1)  1
# (0, 5)  1

# After: Dense format
# [[0 1 0 0 0 1 1]]

# (2)
# columns=columns --> change table columns

# (3)
# index=texts     --> change table rows

Unnamed: 0,cat,dog,eats,food,hot,red,the
the red dog dog,0,2,0,0,0,1,1
cat eats dog dog,1,2,1,0,0,0,0
dog eats food,0,1,1,1,0,0,0
red cat eats,1,0,1,0,0,1,0
the hot dog,0,1,0,0,1,0,1


# Stop-words

Stop-words are words that are not significant to the topic in hand, for example `[am, is, are, in, at, ...]` can be considered stop-words in many applications as they don't add meaning.

In some other domains and problems you may have different kind of stop-words, for example if you are processing some chatbot data you may find `[can you please, would you please, can I, may I, ...]` such examples don't add meaning so stop-words can also be domain specific, and `TFIDF` can help you find these.

In [17]:
# Import the necessary libraries
import pandas as pd                                # For handling data and creating DataFrames
from sklearn.feature_extraction.text import CountVectorizer  # For converting text to numerical features (BoW)

texts = [ 'the red dog dog dog', 'cat eats dog', 'dog eats food',
         'red cat eats', 'the hot dog']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(texts)
x = vectorizer.transform(texts)
columns = vectorizer.get_feature_names_out()
pd.DataFrame(x.todense(), columns=columns, index=texts)

Unnamed: 0,cat,dog,eats,food,hot,red
the red dog dog dog,0,3,0,0,0,1
cat eats dog,1,1,1,0,0,0
dog eats food,0,1,1,1,0,0
red cat eats,1,0,1,0,0,1
the hot dog,0,1,0,0,1,0


> Note that the word `the` was removed here

In [18]:
import pandas as pd

df = pd.read_csv('cleaned_dataset.csv') # create a DataFrame 

In [19]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer  # For converting text to numerical features (BoW)

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

def apply_BoW(texts):
    # Initialize the vectorizer
    vectorizer = CountVectorizer(stop_words='english')
    
    # Fit the vectorizer on the list of texts (corpus)
    vectorizer.fit(texts)
    
    # Transform the text into a document-term matrix
    x = vectorizer.transform(texts)
    
    # Get feature (word) names
    columns = vectorizer.get_feature_names_out()
    
    # Convert to DataFrame for readability
    table = pd.DataFrame(x.todense(), columns=columns, index=texts)
    
    # Print only a small portion of the result
    print("Displaying first 3 rows of the Bag of Words table:\n")
    print(table.head(3))  # Show first 3 rows only
    
    return table

In [20]:
# Apply the function on the column (convert to list)
apply_BoW(df['cleaned_tweet'].astype(str).tolist())

Displaying first 3 rows of the Bag of Words table:

                                                    aa  aaa  aaaaa  aaaaaand  \
father dysfunct selfish drag kid dysfunct            0    0      0         0   
thank credit cant use caus dont offer wheelchai...   0    0      0         0   
bihday majesti                                       0    0      0         0   

                                                    aaaaah  aaaaand  aaahh  \
father dysfunct selfish drag kid dysfunct                0        0      0   
thank credit cant use caus dont offer wheelchai...       0        0      0   
bihday majesti                                           0        0      0   

                                                    aaahhhh  aaahhhhh  aaand  \
father dysfunct selfish drag kid dysfunct                 0         0      0   
thank credit cant use caus dont offer wheelchai...        0         0      0   
bihday majesti                                            0         0     

Unnamed: 0,aa,aaa,aaaaa,aaaaaand,aaaaah,aaaaand,aaahh,aaahhhh,aaahhhhh,aaand,...,zucchini,zulu,zuma,zurich,zx,zydeco,zz,zzz,zzzzzz,zzzzzzzz
father dysfunct selfish drag kid dysfunct,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
thank credit cant use caus dont offer wheelchair van pdx,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bihday majesti,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
love take time ur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
factsguid societi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fish tomorrow carnt wait first time year,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ate isz youuu,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
see nina turner airwav tri wrap mantl genuin hero like shirley chisolm,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
listen sad song monday morn otw work sad,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# N-Grams

N-grams are a way we can use to count for the context in the text. The bigger the n-gram range, the bigger the context you can capture, but also the more features to generate, so be careful not to break your memory.

# N-grams (Extended Step After Tokenization)

`Definition:` An N-gram is a sequence of N consecutive tokens created after tokenization.

`Goal:` Capture context and order between tokens.

`Example:`
If we take the tokenized list above: ["Natural", "Language", "Processing", "is", "amazing"]

`Then:`
- `nigram (1-gram):` ["Natural"], ["Language"], ["Processing"], ["is"], ["amazing"]
- `Bigram (2-gram):` ["Natural Language"], ["Language Processing"], ["Processing is"], ["is amazing"]
- `Trigram (3-gram):` ["Natural Language Processing"], ["Language Processing is"], ["Processing is amazing"]

In [1]:
from nltk import word_tokenize
from nltk.util import ngrams

text = "Natural Language Processing is amazing."
#Word Tokenization
tokens = word_tokenize(text)
print("Simple Tokens:", tokens)

# Generate Unigram (1-gram) Tokenization
unigram = list(ngrams(tokens, 1))
print("Unigram:", unigram)

# Generate bigrams (2-grams) Tokenization
# But Biograms in list (tuple --> list)
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)

# Generate Trigram (3-gram) Tokenization 
trigram = list(ngrams(tokens, 3))
print("Trigram:", trigram)

Simple Tokens: ['Natural', 'Language', 'Processing', 'is', 'amazing', '.']
Unigram: [('Natural',), ('Language',), ('Processing',), ('is',), ('amazing',), ('.',)]
Bigrams: [('Natural', 'Language'), ('Language', 'Processing'), ('Processing', 'is'), ('is', 'amazing'), ('amazing', '.')]
Trigram: [('Natural', 'Language', 'Processing'), ('Language', 'Processing', 'is'), ('Processing', 'is', 'amazing'), ('is', 'amazing', '.')]


In [22]:
texts = [ 'the red dog', 'cat eats dog', 'dog eats food', 'red cat eats', 'the hot dog']

vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
vectorizer.fit(texts)
x = vectorizer.transform(texts)

columns = vectorizer.get_feature_names_out()
print("Columns", columns)
print("-"*80)
df = pd.DataFrame(x.todense(), columns=columns, index=texts)
print(df)

Columns ['cat' 'cat eats' 'dog' 'dog eats' 'eats' 'eats dog' 'eats food' 'food'
 'hot' 'hot dog' 'red' 'red cat' 'red dog']
--------------------------------------------------------------------------------
               cat  cat eats  dog  dog eats  eats  eats dog  eats food  food  \
the red dog      0         0    1         0     0         0          0     0   
cat eats dog     1         1    1         0     1         1          0     0   
dog eats food    0         0    1         1     1         0          1     1   
red cat eats     1         1    0         0     1         0          0     0   
the hot dog      0         0    1         0     0         0          0     0   

               hot  hot dog  red  red cat  red dog  
the red dog      0        0    1        0        1  
cat eats dog     0        0    0        0        0  
dog eats food    0        0    0        0        0  
red cat eats     0        0    1        1        0  
the hot dog      1        1    0        0        

<h1 style='text-align:center'>  TF-IDF </h1>

# TF = Term Frequency
- It measures `how many times` a word `appears` in a `documen`t.
- It represents how important the word is inside the same document.

`Formula:` ùëáùêπ = Number of times the word appears / Total words in the document	‚Äã

# IDF = Inverse Document Frequency
- It measures how important a word is across the whole corpus, not just one document.
    - If a word appears in `many documents` ‚Üí it is `less important` (like: the, and, is).
    - If a word appears in `few documents` ‚Üí it is `more important`.

`Formula:` ùêºùê∑ùêπ = log(ùëÅ/ùëëùëì)

`Where:`
- N = total number of documents
- df = number of documents containing the word

# TF-IDF
- Instead of just counting the frequency of each word, each word here is weighted using TF-IDF

$$W_{x, y} = tf_{x, y} \times log(\frac{N}{df_x})$$


`TF-IDF` combines both `TF` and `IDF` to measure:
    - How important a word is in one document
    compared to
    - How rare it is across all documents

`Formula:` TF-IDF=TF√óIDF

# Summary
- `TF:` Word importance inside one document
- `IDF:` Word importance across all documents
- `TF-IDF:` Combined weight showing unique importance

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
texts = [ 'the red dog dog dog', 'cat eats dog', 'dog eats food','red cat eats', 'the hot dog']
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(texts)
x = vectorizer.transform(texts)
columns = vectorizer.get_feature_names_out()
pd.DataFrame(x.todense(), columns=columns, index=texts)

Unnamed: 0,cat,dog,eats,food,hot,red
the red dog dog dog,0.0,0.902454,0.0,0.0,0.0,0.430787
cat eats dog,0.677803,0.473309,0.562638,0.0,0.0,0.0
dog eats food,0.0,0.423954,0.503968,0.752515,0.0,0.0
red cat eats,0.609818,0.0,0.506204,0.0,0.0,0.609818
the hot dog,0.0,0.490845,0.0,0.0,0.871247,0.0


We can already build some application using only these, let's try a very quick one

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# Example corpus
texts = ["the cat eats food",
         "the dog eats food"]

# Initialize TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(texts)

# Convert to DataFrame for better view
pd.DataFrame(X.toarray(), # or x.todense()
                  columns=vectorizer.get_feature_names_out(),
                  index=texts)

Unnamed: 0,cat,dog,eats,food,the
the cat eats food,0.630099,0.0,0.448321,0.448321,0.448321
the dog eats food,0.0,0.630099,0.448321,0.448321,0.448321
