# **SPAM FILTER USING NAIVE BAYES**

Title: E-mail spam filtering using Naive Bayes' Algorithm

Author: Garvit Budhiraja

Dataset: https://www.kaggle.com/code/mfaisalqureshi/email-spam-detection-98-accuracy

Reference: https://www.youtube.com/watch?v=2sXAYoPIz3A

In [1]:
# import data
import pandas as pd
spam_df = pd.read_csv("spam.csv")

In [2]:
# inspect data
spam_df.groupby("Category").describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [3]:
# creating a new column "spam" which stores 1 if message is spam and 0 if message is ham
spam_df["spam"] = spam_df["Category"].apply(lambda x: 1 if x == "spam" else 0)

In [4]:
spam_df

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,1
5568,ham,Will ü b going to esplanade fr home?,0
5569,ham,"Pity, * was in mood for that. So...any other s...",0
5570,ham,The guy did some bitching but I acted like i'd...,0


In [5]:
# create train/test split
from sklearn.model_selection import train_test_split as tts
x_train, x_test, y_train, y_test = tts(spam_df.Message, spam_df.spam, test_size = 0.20) # 80% to 20% split

In [6]:
x_train.describe()

count                       4457
unique                      4179
top       Sorry, I'll call later
freq                          21
Name: Message, dtype: object

In [7]:
# find word count and store data as matrix
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x_train_count = cv.fit_transform(x_train.values)
x_train_count.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [8]:
# train model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train_count, y_train)

In [9]:
# pre-test ham
email_ham = ["Hi bro, how are you doing?", "Yo broooo"]
email_ham_count = cv.transform(email_ham)
model.predict(email_ham_count)

array([0, 0])

In [10]:
# pre-test spam
email_spam = ["reward money click", "click on the link to win $10000"]
email_spam_count = cv.transform(email_spam)
model.predict(email_spam_count)

array([1, 1])

In [11]:
# test model
x_test_count = cv.transform(x_test)
model.score(x_test_count, y_test)

0.9910313901345291