## Dataset

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

Link : https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

# Tools & Data

In [33]:
import pandas as pd
import numpy as np

In [35]:
df = pd.read_csv("spam.csv", encoding='ansi')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [36]:
df.Category.value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

In [66]:
# Let's create a new column for numerical classification

In [37]:
df['spam'] = df['Category'].apply(lambda x: 1 if x =='spam' else 0)

# lambda mean that for each value on the column, we will change thing.
# In that example if we detect spam in the "Category" column we will write 1 else we will write 0

In [38]:
df.shape

(5572, 3)

In [39]:
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


## Define the X and the y to prepare the construction of the model

In [40]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.2)

# Our x is the messages and y is the output 
# We have 20% in testing size

###  Exploring X & y

In [41]:
X_test.shape

(1115,)

In [42]:
X_train.shape

(4457,)

In [43]:
#Our first 4 emails and their index
X_train[:4]

1257          Am also doing in cbe only. But have to pay.
21      I‰Û÷m going to try for 2 months ha ha only joking
2592    My friend just got here and says he's upping h...
69                     I plane to give on this month end.
Name: Message, dtype: object

In [44]:
y_train[:4]

#They are not spam !

1257    0
21      0
2592    0
69      0
Name: spam, dtype: int64

# Converting words into vector with BOW

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

# We start by converting our X_train

X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<4457x7702 sparse matrix of type '<class 'numpy.int64'>'
	with 59330 stored elements in Compressed Sparse Row format>

In [67]:
# We change it into an array

X_train_cv.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [68]:
# Let's look at the first 2 samples

X_train_cv.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [47]:
X_train_cv.shape

(4457, 7702)

## Exploring the vocabulary

In [48]:
# get_feature_names_out

v.get_feature_names_out()[1771]

'children'

In [49]:
# Look at every single word in the vocabulary

v.vocabulary_

{'am': 934,
 'also': 925,
 'doing': 2373,
 'in': 3617,
 'cbe': 1679,
 'only': 4908,
 'but': 1562,
 'have': 3334,
 'to': 6878,
 'pay': 5089,
 'going': 3148,
 'try': 7003,
 'for': 2918,
 'months': 4551,
 'ha': 3270,
 'joking': 3826,
 'my': 4638,
 'friend': 2981,
 'just': 3858,
 'got': 3176,
 'here': 3385,
 'and': 962,
 'says': 5890,
 'he': 3342,
 'upping': 7141,
 'his': 3415,
 'order': 4948,
 'by': 1576,
 'few': 2804,
 'grams': 3197,
 'lt': 4225,
 'gt': 3241,
 'when': 7416,
 'can': 1618,
 'you': 7646,
 'get': 3092,
 'plane': 5206,
 'give': 3120,
 'on': 4897,
 'this': 6804,
 'month': 4548,
 'end': 2562,
 'say': 5888,
 'leh': 4049,
 'of': 4857,
 'course': 2020,
 'nothing': 4803,
 'happen': 3305,
 'lar': 3994,
 'not': 4799,
 'romantic': 5774,
 'jus': 3857,
 'bit': 1359,
 'lor': 4181,
 'thk': 6805,
 'nite': 4756,
 'scenery': 5898,
 'so': 6241,
 'nice': 4739,
 'lag': 3974,
 'that': 6760,
 'the': 6766,
 'sad': 5834,
 'part': 5058,
 'we': 7352,
 'keep': 3888,
 'touch': 6939,
 'thanks': 6752,
 '

In [70]:
# What else can you do with CountVectorizer

dir(v)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_feature_names',
 '_check_n_features',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sort_features',
 '_stop_words_id',
 '_validate_data',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',

In [50]:
# Let's store the transformation in a variable
# Now we will explore our first message (bunch of 0 right?)
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [51]:
# Let's look at the value which are not 0

np.where(X_train_np[0]!=0)

(array([ 925,  934, 1562, 1679, 2373, 3334, 3617, 4908, 5089, 6878],
       dtype=int64),)

In [72]:
# Our first email in X_train has an index of 1257
# Let's explore it

X_train[:4]

1257          Am also doing in cbe only. But have to pay.
21      I‰Û÷m going to try for 2 months ha ha only joking
2592    My friend just got here and says he's upping h...
69                     I plane to give on this month end.
Name: Message, dtype: object

In [73]:
X_train[:4][1257]

'Am also doing in cbe only. But have to pay.'

In [74]:
# We saw that we have some non null value on this vector 
#([ 925,  934, 1562, 1679, 2373, 3334, 3617, 4908, 5089, 6878])

X_train_np[0][6878]

1

In [76]:
# What is the word for 6878

v.get_feature_names_out()[6878]

# "Am also doing in cbe only. But have to pay."

'to'

# Building the model

In [57]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_cv, y_train)

MultinomialNB()

# Testing

We need to convert also X_test to have an idea about the precision

In [59]:
X_test.cv = v.transform(X_test)

In [63]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test.cv)

print(classification_report(y_test,y_pred)) #Truth and Prediction

              precision    recall  f1-score   support

           0       0.99      1.00      1.00       976
           1       1.00      0.95      0.97       139

    accuracy                           0.99      1115
   macro avg       1.00      0.97      0.99      1115
weighted avg       0.99      0.99      0.99      1115



In [64]:
# Let's try it !

emails = [
    "Hi Asma, you did a very good job at the presentation yesterday, talk to you soon",
    "More than 50% discount to your next purchase! go visit our web site to know more and how you can win more than 1 million dollar in our lottery"
]

In [65]:
# Transformation and Prediction

email_count = v.transform(emails)
model.predict(email_count)

array([0, 1], dtype=int64)

You can see that it detected the first as spam and the second as not spam.