### vectorization 
- is the process of converting text data into numerical representations, enabling machine learning models to process and analyze the information. This transformation is essential because algorithms require numerical input to perform computations and learn patterns from the data
- Bag of Words (BoW): Represents text by counting the frequency of each word in a document, ignoring grammar and word order.
- TF-IDF (Term Frequency-Inverse Document Frequency): Weighs word frequencies by how unique they are across a corpus, highlighting distinctive terms.
- Word Embeddings (e.g., Word2Vec, GloVe): Maps words to dense vectors in a continuous vector space, capturing semantic relationships between words.
- Contextualized Embeddings (e.g., BERT): Generates word representations based on the context in which they appear, allowing for more nuanced understanding
- !pip install pandas
- !pip install scikit-learn
- Scikit-learn (also known as sklearn) is a free and open-source machine learning library for Python. It offers a wide range of tools for data preprocessing, model selection, and evaluation, making it a preferred choice for both beginners and professionals in the field of machine learning

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [11]:
df = pd.read_csv('../data/spam.csv')

In [12]:
df.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [13]:
df.Category.value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

In [14]:
df.Category.value_counts()/len(df)*100

Category
ham     86.593683
spam    13.406317
Name: count, dtype: float64

In [15]:
df['spam'] = df['Category'].apply(lambda x: 1 if x=='spam' else 0)

In [16]:
df.head(5)

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [21]:
new_df = pd.read_csv('../data/spam.csv')

In [22]:
new_df.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [23]:
df.head(5)

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [24]:
df.shape

(5572, 3)

### Train test split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam,test_size=0.2)

In [26]:
X_train.shape

(4457,)

In [27]:
X_test.shape

(1115,)

In [28]:
X_train[:4]

2975           I'll text carlos and let you know, hang on
293                                 Oops. 4 got that bit.
628     Yup i thk they r e teacher said that will make...
2868    Mum, i've sent you many many messages since i ...
Name: Message, dtype: object

In [29]:
y_train[:4]

2975    0
293     0
628     0
2868    0
Name: spam, dtype: int64

#### **Create bag of words representation using CountVectorizer**

In [30]:
v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_test_cv = v.transform(X_test)

In [31]:
X_train_cv

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 59352 stored elements and shape (4457, 7765)>

In [32]:
X_train_cv.toarray()[:2][0]

array([0, 0, 0, ..., 0, 0, 0], shape=(7765,))

In [33]:
X_train_cv.shape

(4457, 7765)

In [34]:
v.get_feature_names_out()[1771]

'cheers'

In [35]:
v.vocabulary_

{'ll': 4161,
 'text': 6810,
 'carlos': 1672,
 'and': 970,
 'let': 4090,
 'you': 7730,
 'know': 3962,
 'hang': 3325,
 'on': 4945,
 'oops': 4968,
 'got': 3205,
 'that': 6833,
 'bit': 1379,
 'yup': 7751,
 'thk': 6880,
 'they': 6863,
 'teacher': 6757,
 'said': 5884,
 'will': 7549,
 'make': 4332,
 'my': 4675,
 'face': 2752,
 'look': 4198,
 'longer': 4195,
 'darren': 2173,
 'ask': 1104,
 'me': 4415,
 'not': 4844,
 'cut': 2136,
 'too': 7002,
 'short': 6120,
 'mum': 4650,
 've': 7288,
 'sent': 6028,
 'many': 4356,
 'messages': 4467,
 'since': 6179,
 'here': 3418,
 'just': 3880,
 'want': 7407,
 'to': 6960,
 'are': 1061,
 'actually': 819,
 'getting': 3129,
 'them': 6846,
 'do': 2380,
 'enjoy': 2609,
 'the': 6836,
 'rest': 5756,
 'of': 4907,
 'your': 7733,
 'day': 2186,
 'aight': 892,
 'sorry': 6337,
 'take': 6710,
 'ten': 6789,
 'years': 7703,
 'shower': 6136,
 'what': 7505,
 'plan': 5243,
 'yes': 7713,
 'but': 1584,
 'dont': 2416,
 'care': 1661,
 'need': 4738,
 'bad': 1207,
 'princess': 5423,
 

In [36]:
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0], shape=(7765,))

In [37]:
np.where(X_train_np[0]!=0)

(array([ 970, 1672, 3325, 3962, 4090, 4161, 4945, 6810, 7730]),)

#### **Naive Bayes Classifier**

In [38]:
model = MultinomialNB()
model.fit(X_train_cv, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [39]:
y_pred = model.predict(X_test_cv)

In [40]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       969
           1       0.98      0.95      0.96       146

    accuracy                           0.99      1115
   macro avg       0.99      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115



#### **Test on a random datapoint**

In [41]:
message = {"Upto 20% off on parking, exclusing offer just for you"}

In [42]:
message_cnt = v.transform(message)

model.predict(message_cnt)

array([1])