<a href="https://colab.research.google.com/github/ghassenov/NLP_Basics/blob/main/Bag_of_words_(BOW).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Vectorization of Textual Data in NLP
* vectorization is the process of converting text data into numerical vectors so that ML models can process and analyze it. Since algorithms work with numbers (not raw text), vectorization bridges this gap by representing words,sentences or documents as numerical features.

* Popular Vectorization Methods: BOW - TF-IDF - N-grams

Word Embeddings
* Word Embeddings are dense, low-dimensional vector representations of words that capture semantic and syntactic meanings.

Bag Of Words or BOW (a Vectorization technique)
* The Bag of Words (BoW) model is one of the simplest and most traditional methods for text vectorization. It converts text into numerical vectors by counting word frequencies, ignoring grammar, word order, and context.

How BOW works?
* Tokenization: splits the text into individual words (tokens)
* Vocabulary Creation: builds a dictionary of unique words from the entire corpus.
* Vectorization: each document is represented as a vector where each dimension corresponds to a word in the vocabulary, and the value is its frequency

What are the limitations?
*  No Semantic Meaning
– Treats words independently (no understanding of context).
*  High Dimensionality
– If vocabulary is large, vectors become sparse (many zeros).
* Ignores Word Order – "Cat bites dog" and "Dog bites cat" have the same BoW representation.

In [32]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [2]:
df = pd.read_csv('/content/spam.csv')

In [3]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df.shape

(5572, 2)

Let's see the distribution of the Category column.

In [6]:
df['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
ham,4825
spam,747


we can also look it up using precentages

In [7]:
df['Category'].value_counts()/len(df)*100.0

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
ham,86.593683
spam,13.406317


In [8]:
df['spam'] = df['Category'].apply( lambda x : 1 if x == 'spam' else 0)

In [9]:
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [10]:
new_df = pd.read_csv('/content/spam.csv')

In [13]:
new_df['Category'].replace({'ham':0,'spam':1},inplace=True)
new_df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


## Train Test Split

In [15]:
X_train,X_test,y_train,y_test = train_test_split(df.Message,df.spam,test_size=0.25)

In [16]:
X_train.shape

(4179,)

In [17]:
X_test.shape

(1393,)

Creatig BOW representation using CountVectorizer

In [19]:
v = CountVectorizer()

Purpose : Creates a vectorizer to convert text into word-count vectors.
* Default behavior: converts text to lowercase/ ignores punctuation/ splits text into words (tokens)

In [20]:
X_train_cv = v.fit_transform(X_train.values)

fit_transform() does to things:
* fit() learns the vocabulary from X_train (identifies all unique words)
* transform(): converts each mail into a vector of word counts

In [21]:
X_test_cv = v.transform(X_test.values)

This applies the same vocabulary to the test set.
* Test data must use the same word mappings as training to avoid data leakage

In [23]:
X_train_cv.toarray()[:2][0]

array([0, 0, 0, ..., 0, 0, 0])

In [24]:
X_train_cv.shape

(4179, 7497)

Here we have 4179 training mails, and 7497 unique words in the vocabulary
* The output is a sparse matrix (most entries are 0 since each SMS uses only a few words)

In [25]:
v.get_feature_names_out()

array(['00', '000', '000pes', ..., 'èn', 'ú1', '〨ud'], dtype=object)

This returns all unique words (features in the vocabulary)

In [26]:
v.vocabulary_

{'future': 2937,
 'is': 3623,
 'not': 4682,
 'what': 7233,
 'we': 7181,
 'planned': 5073,
 'for': 2827,
 'tomorrow': 6742,
 'it': 3634,
 'the': 6601,
 'result': 5574,
 'of': 4745,
 'do': 2278,
 'today': 6723,
 'best': 1257,
 'in': 3520,
 'present': 5218,
 'enjoy': 2500,
 'thank': 6590,
 'you': 7462,
 'like': 3977,
 'as': 1031,
 'well': 7216,
 'lol': 4043,
 'have': 3239,
 'to': 6716,
 'take': 6480,
 'member': 4292,
 'how': 3399,
 'said': 5696,
 'my': 4517,
 'aunt': 1089,
 'flow': 2794,
 'didn': 2213,
 'visit': 7075,
 'months': 4430,
 'cause': 1615,
 'developed': 2192,
 'ovarian': 4870,
 'cysts': 2058,
 'bc': 1198,
 'only': 4797,
 'way': 7177,
 'shrink': 5942,
 'them': 6611,
 'motivate': 4447,
 'behind': 1240,
 'every': 2563,
 'darkness': 2080,
 'there': 6619,
 'shining': 5894,
 'light': 3974,
 'waiting': 7125,
 'find': 2741,
 'friend': 2891,
 'always': 877,
 'trust': 6833,
 'and': 903,
 'love': 4082,
 'bslvyl': 1461,
 'yo': 7458,
 'guess': 3153,
 'just': 3744,
 'dropped': 2368,
 'awesom

A dictionary mapping each word to its column index in the bow matrix

In [27]:
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0])

In [28]:
np.where(X_train_np[0] != 0)

(array([1257, 2278, 2500, 2827, 2937, 3520, 3623, 3634, 4682, 4745, 5073,
        5218, 5574, 6601, 6723, 6742, 7181, 7233]),)

Naive Bayes Classifier

In [30]:
model = MultinomialNB()
model.fit(X_train_cv,y_train)

In [31]:
y_pred = model.predict(X_test_cv)

In [34]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1190
           1       0.98      0.91      0.94       203

    accuracy                           0.98      1393
   macro avg       0.98      0.95      0.97      1393
weighted avg       0.98      0.98      0.98      1393

