SMS SPAM Detection

Table of Content

Demo
Abstract
Motivation
About the Data
Modeling
Credits

Demo

The finished product can be viewed here

Abstract

SPAM SMS Detector is an end-to-end Machine Learning NLP project built using Flask and deployed to Heroku Cloud Platform. This app will help you predict whether the SMS that you recieved is legit (HAM) or SPAM. The dataset used for this project is taken from Kaggle has 5,574 records, 749 SPAM and 4,825 HAM, more info about the data in "About the Data" section.
Out of all the ML models used Naive Bayes seems to have preformed decent with accuracy of 99.01% and precision of 98.56%. Built various pipelines with MultinomialNB classifier and Bag of Words as well as TFIDF approach, taking different classification metrices (Accuracy, Precision, f1-score, roc-auc-score) into account, also tuned the hyperparameters. The final output looks something like this -

.

Motivation

As a part of #66daysoddata community created by Ken Jee from youtube, I did this project to be consistent in my DS, ML and DL learning journey and building a habit of doing Data Science daily.
And also because my curiosity in NLP has spiked to an extent where I have feed myself with the text data and perform some analysis on it on a daily basis. I also want to explore, what more I can do with NLP

About the Data

Overview

Our dataset originally has 5,574 records, 749 SPAM and 4,825 HAM, which you can see from the distribution below.
- Our dataset is clearly imbalanced, but trust me, with Naive Bayes as our model used, the imbalanced dataset seems to have no affect on the performance, as our accuracy and precision, both are decent.

Analysis of SPAM messages -

SPAM SMSs tends to be bigger than legit SMSs, which is evident from the graph below.
- Mean length of SPAM SMSs is close to 140, which is almost double the length of Legitmate SMS which is close to 71.

The most common words used in SPAM messages can be seen from the WordCloud below.

Here, is the list of top 10 most used words used in SPAM Messages

	Word	Frequency
0	call	388
1	free	228
2	txt	170
3	text	145
4	mobile	142
5	stop	128
6	claim	115
7	reply	106
8	www	98
9	prize	97

This graph says the same -

For more detailed analysis, you can have a look at my Ipython Notebook for SPAM SMS EDA and Preprocessing

Modeling

I used Naive Bayes because it
- works well with large no. of features, and after performing Word Vectorization, no. of features has become more than 6000.
- converges faster while training the model.
- works well with categorical data.
- is preferrable while working with the text data.
- outperformed other models like SVM, Logistic Regression etc.
The Math behind the metrices used

I performed GridSearchCV 4 times, each time I aim at different metrics to optimize and hence constructed 4 different pipelines.
- Pipeline 1 - Aimed for better Accuracy
- Pipeline 2 - Aimed for better F1-Score
- Pipeline 3 - Aimed for better Precision
- Pipeline 4 - Aimed for better ROC-AUC
Confusion Matrix for all the pipelines can be seen below -
- Pipeline 1: Accuracy
- Pipeline 2: F1-Score
- Pipeline 3: Precision
- Pipeline 4: ROC-AUC-Score

Performance Report of all the Pipelines -

	pipeline1	pipeline2	pipeline3	pipeline4
accuracy	0.990117	0.990117	0.977538	0.983827
f1-score	0.961404	0.961404	0.906367	0.934783
precision	0.985612	0.985612	1.000000	0.992308
recall	0.938356	0.938356	0.828767	0.883562
roc auc score	0.968144	0.968144	0.914384	0.941264

When we aim for better accuracy and f1-score, we get exactly same report. Therefore, Pipeline1 and Pipeline2 will work similarly, which is evident from their Confusion Metrices.
When we aimed for better precision, we get Zero False Positives, but it has impacted other metrices like Accuracy, Recall, f1-score and ROC-AUC-Score
Pipeline4 worked fine.
Their performance can be seen from the graph below -

.
Choosing Best Pipeline - The Dilemma of Accuracy and Precision
- Why I am stressing on Precision?
  - Because our goal is to reduce the False Postives, which means SMSs which are falsely predicted as SPAM, otherwise the user might miss on some important SMSs and maximizing precision will help us in minimizing false positive, which is evident from the formula of precision, for which you can refer this section
  - Pipeline 1 gives best accuracy and decent precision
  - Pipeline 3 gives best precision with zero false positives but screws up other scores.
Keeping all these things in mind, I have chosen pipeline1 which uses Naive Bayes Classifier and TF-IDF Word Vectorizer
The Final Classification Report for this pipleine is shown below -

	score
accuracy	99.01
f1-score	96.14
precision	98.56
recall	93.83
roc auc score	96.81

How Our Naive Bayes Model can be Fooled?

I humbly acknowledge that my NB model can be fooled easily, if we stuff our SMS with lots and lots of "hammy words" (meaning the words that generally doesn't appear in SPAM messages Wordcloud). Therefore, the Choice of words can easily fool our model

For more details you can have a look at my Ipython notebook for NLP Pipeline building.

Credits

Google Colaboratory for providing the computational power to build and train the models.
Heroku Cloud Platform for making the deployment of this project possible.
#66daysofdata community created by KenJee that is keeping me accountable and consistent in my Data Science Journey.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
images		images
notebooks		notebooks
static		static
templates		templates
Procfile		Procfile
README.md		README.md
app.py		app.py
best_pipeline.pkl		best_pipeline.pkl
nltk.txt		nltk.txt
requirements.txt		requirements.txt
spam_sms_collection.csv		spam_sms_collection.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMS SPAM Detection

Table of Content

Demo

Abstract

Motivation

About the Data

Overview

Analysis of SPAM messages -

Modeling

The Math behind the metrices used

Choosing Best Pipeline - The Dilemma of Accuracy and Precision

How Our Naive Bayes Model can be Fooled?

Credits

About

Releases

Packages

Languages

animesh-algorithm/spam-detection

Folders and files

Latest commit

History

Repository files navigation

SMS SPAM Detection

Table of Content

Demo

Abstract

Motivation

About the Data

Overview

Analysis of SPAM messages -

Modeling

The Math behind the metrices used

Choosing Best Pipeline - The Dilemma of Accuracy and Precision

How Our Naive Bayes Model can be Fooled?

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages