# Feature Extraction

In order to feed text to a model we need to transform it to a numerical features, in this notebook we will discuss how to build a bag-of-words model from text to use it later for different applications.

# Bag of words

Count the occurrences of words in the corpus.

In [1]:
# Import the necessary libraries
import pandas as pd                                # For handling data and creating DataFrames
from sklearn.feature_extraction.text import CountVectorizer  # For converting text to numerical features (BoW)

# Step 1: Define a small corpus (a list of short text documents)

In [2]:
texts = [
    'the red dog',
    'cat eats dog',
    'dog eats food',
    'red cat eats',
    'the hot dog'
]

# Step 2: Initialize the CountVectorizer

In [3]:
# This object will build a vocabulary and count the frequency of each word in the corpus
vectorizer = CountVectorizer()

# Step 3: Learn the vocabulary from the text data

In [5]:
# This step identifies all unique words and assigns an index to each
z = vectorizer.fit(texts)

# Step 4: Transform the original text data into a document-term matrix

In [6]:
# Each row represents a document, each column represents a word, and each cell shows word frequency
x = vectorizer.transform(texts) # create Document-Term Matrix (DTM) 
print(x.shape) #(Num_of_rows, Num_of_Cols)
print(x) # (document_index, word_index)	word_count

(5, 7)
  (0, 1)	1
  (0, 5)	1
  (0, 6)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	1
  (3, 0)	1
  (3, 2)	1
  (3, 5)	1
  (4, 1)	1
  (4, 4)	1
  (4, 6)	1


# Step 5: Get the list of words (vocabulary) that CountVectorizer found

In [7]:
# (updated version: get_feature_names_out() instead of get_feature_names())
columns = vectorizer.get_feature_names_out()
print(columns)

['cat' 'dog' 'eats' 'food' 'hot' 'red' 'the']


# Step 6: Create a DataFrame to visualize the document-term matrix in a readable tabular format

In [8]:
# Rows = original texts, Columns = words from the vocabulary
pd.DataFrame(x.todense(), columns=columns, index=texts)

## Parameters:
# (1)
# x.todense() change Document-Term Matrix (DTM) to Dense Matrix
## Before: Sparse format (only non-zero values shown)
# (0, 1)  1
# (0, 5)  1

# After: Dense format
# [[0 1 0 0 0 1 1]]

# (2)
# columns=columns --> change table columns

# (3)
# index=texts     --> change table rows

Unnamed: 0,cat,dog,eats,food,hot,red,the
the red dog,0,1,0,0,0,1,1
cat eats dog,1,1,1,0,0,0,0
dog eats food,0,1,1,1,0,0,0
red cat eats,1,0,1,0,0,1,0
the hot dog,0,1,0,0,1,0,1


# Stop-words

Stop-words are words that are not significant to the topic in hand, for example `[am, is, are, in, at, ...]` can be considered stop-words in many applications as they don't add meaning.

In some other domains and problems you may have different kind of stop-words, for example if you are processing some chatbot data you may find `[can you please, would you please, can I, may I, ...]` such examples don't add meaning so stop-words can also be domain specific, and `TFIDF` can help you find these.

In [9]:
# Import the necessary libraries
import pandas as pd                                # For handling data and creating DataFrames
from sklearn.feature_extraction.text import CountVectorizer  # For converting text to numerical features (BoW)

texts = [ 'the red dog dog dog', 'cat eats dog', 'dog eats food',
         'red cat eats', 'the hot dog']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(texts)
x = vectorizer.transform(texts)
columns = vectorizer.get_feature_names_out()
pd.DataFrame(x.todense(), columns=columns, index=texts)

Unnamed: 0,cat,dog,eats,food,hot,red
the red dog dog dog,0,3,0,0,0,1
cat eats dog,1,1,1,0,0,0
dog eats food,0,1,1,1,0,0
red cat eats,1,0,1,0,0,1
the hot dog,0,1,0,0,1,0


> Note that the word `the` was removed here

In [10]:
import pandas as pd

df = pd.read_csv('cleaned_dataset.csv') # create a DataFrame 

In [11]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer  # For converting text to numerical features (BoW)

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

def apply_BoW(texts):
    # Initialize the vectorizer
    vectorizer = CountVectorizer(stop_words='english')
    
    # Fit the vectorizer on the list of texts (corpus)
    vectorizer.fit(texts)
    
    # Transform the text into a document-term matrix
    x = vectorizer.transform(texts)
    
    # Get feature (word) names
    columns = vectorizer.get_feature_names_out()
    
    # Convert to DataFrame for readability
    table = pd.DataFrame(x.todense(), columns=columns, index=texts)
    
    # Print only a small portion of the result
    print("Displaying first 3 rows of the Bag of Words table:\n")
    print(table.head(3))  # Show first 3 rows only
    
    return table

In [12]:
# Apply the function on the column (convert to list)
apply_BoW(df['cleaned_tweet'].astype(str).tolist())

Displaying first 3 rows of the Bag of Words table:

                                                    aa  aaa  aaaaa  aaaaaand  \
father dysfunct selfish drag kid dysfunct            0    0      0         0   
thank credit cant use caus dont offer wheelchai...   0    0      0         0   
bihday majesti                                       0    0      0         0   

                                                    aaaaah  aaaaand  aaahh  \
father dysfunct selfish drag kid dysfunct                0        0      0   
thank credit cant use caus dont offer wheelchai...       0        0      0   
bihday majesti                                           0        0      0   

                                                    aaahhhh  aaahhhhh  aaand  \
father dysfunct selfish drag kid dysfunct                 0         0      0   
thank credit cant use caus dont offer wheelchai...        0         0      0   
bihday majesti                                            0         0     

Unnamed: 0,aa,aaa,aaaaa,aaaaaand,aaaaah,aaaaand,aaahh,aaahhhh,aaahhhhh,aaand,...,zucchini,zulu,zuma,zurich,zx,zydeco,zz,zzz,zzzzzz,zzzzzzzz
father dysfunct selfish drag kid dysfunct,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
thank credit cant use caus dont offer wheelchair van pdx,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bihday majesti,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
love take time ur,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
factsguid societi,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fish tomorrow carnt wait first time year,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ate isz youuu,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
see nina turner airwav tri wrap mantl genuin hero like shirley chisolm,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
listen sad song monday morn otw work sad,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
