The goal of this notebook is to apply text transformation (vectorization) using different approaches, including bag of words, tf-idf, and text to sequence of numbers. This is a common and essential step in natural language processing (NLP) before feeding the data into machine learning algorithms.

In [16]:
# Import libraries
import pandas as pd  # For working with structured data in a tabular form
import numpy as np  # Essential for numerical operations and working with arrays
import warnings  # Used to control warnings (filtering them in this case)
import re  # Regular expressions for text pattern matching and manipulation
from collections import Counter  # Useful for counting occurrences of elements in a collection
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  # For transforming text data into numerical features
from gensim.models import Word2Vec  # Used for word embedding, capturing semantic relationships between words
from gensim.models.phrases import Phrases, Phraser  # For detecting common phrases (bigrams) in a corpus
from gensim.utils import simple_preprocess  # Utility for tokenizing text into words

In [61]:
# List of preprocessed tokens for each document
preprocessed_tokens = [
                        ['include', 'frequent', 'business', 'primarily', 'change', 'strategic', 'metric', 'operation', 'marketing', 'tool', 'check', 'base', 'input', 'calculate', 'deck', 'growth', 'fast', 'center', 'connect', 'overview', 'leadership', 'mapping', 'communication', 'driver', 'structure', 'pbna', 'source', 'pl', 'transfer', 'around', 'help', 'leader', 'facility', 'leverage', 'requestor', 'flawless', 'insight', 'develop', 'explain', 'partner', 'act', 'future', 'delivery', 'flag', 'primary', 'adherence', 'able', 'explanation', 'key', 'strong', 'agree', 'endusers', 'responsible', 'optimization', 'relate', 'interval', 'within', 'performance', 'work', 'resource', 'stakeholder', 'painpoint', 'monitor', 'scope', 'datum', 'team', 'execute', 'executor', 'enhance', 'pepsico', 'management', 'kpi', 'emerge', 'template', 'upon', 'timeline', 'feedback', 'element', 'operational', 'process', 'quality', 'role', 'sparkle', 'multiple', 'exist', 'opportunity', 'bottleneck', 'level', 'brand', 'customer', 'service', 'curate', 'project', 'coe', 'responsibility', 'loop', 'require', 'line', 'internal', 'sectorfunctional', 'deliver', 'output', 'ongoing', 'competitor', 'incl', 'qualification', 'critical', 'beyond', 'presentation', 'improve', 'channel', 'diagnostic', 'enduser', 'effort', 'support', 'consideration', 'market', 'utilize', 'portfolio', 'dashboard', 'ssc', 'risk', 'workflow', 'world', 'content', 'measure', 'report', 'analyst', 'vertical', 'fuel', 'hisher', 'automate', 'knowledge'],
                        ['charter', 'reporting', 'deliverable', 'need', 'finetune', 'player', 'outcome', 'align', 'focus', 'regular', 'incorporate', 'recruitment', 'plan']

                      ]

# Create a corpus: a collection of text (list of strings)
corpus = [' '.join(tokens) for tokens in preprocessed_tokens]

# Print the corpus
print("Corpus:")
print(corpus)

Corpus:
['include frequent business primarily change strategic metric operation marketing tool check base input calculate deck growth fast center connect overview leadership mapping communication driver structure pbna source pl transfer around help leader facility leverage requestor flawless insight develop explain partner act future delivery flag primary adherence able explanation key strong agree endusers responsible optimization relate interval within performance work resource stakeholder painpoint monitor scope datum team execute executor enhance pepsico management kpi emerge template upon timeline feedback element operational process quality role sparkle multiple exist opportunity bottleneck level brand customer service curate project coe responsibility loop require line internal sectorfunctional deliver output ongoing competitor incl qualification critical beyond presentation improve channel diagnostic enduser effort support consideration market utilize portfolio dashboard ssc ri

#### Bag of Words (BoW)

The Bag of Words (BoW) model is a text representation technique that captures the frequency of each word in a document, disregarding the order and structure of the words. It represents a document as an unordered set of words, creating a numerical vector where each element corresponds to the count of a specific word. BoW is a simple yet effective approach for converting text data into a format suitable for machine learning algorithms, but it lacks semantic understanding and does not consider word relationships or context.


In [62]:
# Use CountVectorizer function from sklearn library to perform bag of words transformation

CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2),
                           stop_words='english')

# Transform
count_data = CountVec.fit_transform(corpus)

# Create dataframe
vectorized_bow = count_data.toarray()
df_bow = pd.DataFrame(vectorized_bow, columns = CountVec.get_feature_names_out())

In [63]:
df_bow.head()

Unnamed: 0,able,act,adherence,agree,align,analyst,automate,base,bottleneck,brand,...,team,template,timeline,tool,transfer,utilize,vertical,work,workflow,world
0,1,1,1,1,0,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Tf-idf

Next, we explore the TF-IDF method, a powerful text representation technique that goes beyond the simplicity of word counts. TF-IDF takes into account not only the frequency of terms in a document but also their importance in the entire corpus. This approach provides a nuanced representation, highlighting terms that are not just frequent but also distinctive, offering a more refined perspective for natural language processing tasks.

In [64]:
# Without smooth IDF

# Define tf-idf
tf_idf_vec = TfidfVectorizer(use_idf=True,
                             smooth_idf=False,
                             ngram_range=(1,1),stop_words='english') # to use only  bigrams ngram_range=(2,2)
# Transform
tf_idf_data = tf_idf_vec.fit_transform(corpus)

# Create dataframe
vectorized_tf_idf = tf_idf_data.toarray()
df_tf_idf = pd.DataFrame(vectorized_tf_idf,columns=tf_idf_vec.get_feature_names_out())

In [65]:
df_tf_idf.head()

Unnamed: 0,able,act,adherence,agree,align,analyst,automate,base,bottleneck,brand,...,team,template,timeline,tool,transfer,utilize,vertical,work,workflow,world
0,0.088045,0.088045,0.088045,0.088045,0.0,0.088045,0.088045,0.088045,0.088045,0.088045,...,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045
1,0.0,0.0,0.0,0.0,0.27735,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
# With smooth
tf_idf_vec_smooth = TfidfVectorizer(use_idf=True,
                                    smooth_idf=True,
                                    ngram_range=(1,1),stop_words='english')

# Transform
tf_idf_data_smooth = tf_idf_vec_smooth.fit_transform(corpus)

# Create dataframe
vectorized_tf_idf_smooth = tf_idf_data_smooth.toarray()
df_tf_idf_smooth = pd.DataFrame(vectorized_tf_idf_smooth,columns=tf_idf_vec_smooth.get_feature_names_out())

In [67]:
df_tf_idf_smooth.head()

Unnamed: 0,able,act,adherence,agree,align,analyst,automate,base,bottleneck,brand,...,team,template,timeline,tool,transfer,utilize,vertical,work,workflow,world
0,0.088045,0.088045,0.088045,0.088045,0.0,0.088045,0.088045,0.088045,0.088045,0.088045,...,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045,0.088045
1,0.0,0.0,0.0,0.0,0.27735,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Word embedding

Word embeddings are a type of representation for words in a vector space, where the spatial relationships between vectors capture semantic similarities between words. Unlike methods such as Bag-of-Words or TF-IDF, word embeddings consider the context in which words appear, allowing them to capture subtle meanings and relationships. Popular algorithms like Word2Vec, GloVe, and fastText are used to generate these embeddings, providing a dense and continuous representation that has proven valuable in various natural language processing tasks, including sentiment analysis, language translation, and document clustering.

In [79]:
# Train Word2Vec model
word2vec_model = Word2Vec(sentences=preprocessed_tokens, vector_size=100, window=5, min_count=1, workers=4)

def get_embedding(sentence, model):
    # Obtain the embedding for the entire sentence
    try:
        return model.wv[sentence]
    except KeyError:
        # If the sentence contains words not in the model's vocabulary, return zeros
        return [0.0] * model.vector_size

# Apply word embeddings to the entire list of tokenized sentences
embeddings = [get_embedding(sentence, word2vec_model) for sentence in preprocessed_tokens]

In [80]:
embeddings

[array([[-0.00613229,  0.00818415, -0.00648556, ..., -0.00633476,
         -0.00680626, -0.0078928 ],
        [-0.00333522, -0.00689669,  0.00649364, ...,  0.00372715,
          0.00802185, -0.00251265],
        [ 0.00698195, -0.0002085 , -0.00794451, ...,  0.00725142,
         -0.0037523 , -0.00743677],
        ...,
        [ 0.00795717, -0.00677211,  0.00031158, ...,  0.00920589,
          0.00928318,  0.0031113 ],
        [ 0.00972337, -0.00112904, -0.0070228 , ...,  0.00348369,
         -0.00564961,  0.00584573],
        [-0.00156977,  0.0022512 ,  0.00540881, ..., -0.00701191,
          0.00612427,  0.0072808 ]], dtype=float32),
 array([[ 2.4947638e-03,  5.9994394e-03, -9.6815564e-03, ...,
          1.1123770e-03, -2.4167644e-03, -3.4578741e-03],
        [ 4.3556388e-03,  6.8813427e-03,  9.0298365e-04, ...,
          8.8008018e-03,  1.7078171e-03, -3.3627371e-03],
        [ 6.1708512e-03, -2.0284872e-03,  6.2553375e-03, ...,
         -1.2160570e-03,  9.8541807e-03,  3.6791752e-03]