# About this project


Spam messages are unsolicited and unwanted messages. Fraudsters use spam messages to trick people into giving them your personal information — things like your password, account number, or even credit card information.

These messages are designed in such a way people fall for it. This is because it is difficult for people with little knowledge about scams to determine if sms is from a scammer.



In this project, I will build an application that can help determine if an SMS is spam or not. The project is all about teaching the computer how to classify SMS as spam or not spam in order to help us determine whether an SMS is spam or not. To do that, I will use the **Multinomial Naive Bayes algorithm** along with a dataset of 5,572 SMS messages that are already classified by humans.

For this project, my goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so i expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

THIS IS A MACHINE LEARNING CLASSIFICATION PROBLEM


In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction import DictVectorizer 


### Exploratory Data analysis (EDA)

In [2]:
#read the sms data
data=pd.read_csv("SMSSpamCollection",sep='\t',header=None,names=['Label', 'SMS'])

In [3]:
data.shape

(5572, 2)

In [4]:
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
(data["Label"]=="ham").value_counts(normalize=True)

True     0.865937
False    0.134063
Name: Label, dtype: float64

### Observation: 
- The data set has two columns Label and sms.
- The label column has two unique values ham(not spam) and Spam. 
- The SMS column contains different unique messages. This messages are labled on the label column.
- The data has 5572 rows
- Almost 87% of the SMS messages are classified as Non - Spam (ham) and the remaining 13% are classified as Spam.

In [6]:
## Randomise the dataset
randomised_data=data.sample(frac=1,random_state=1)
randomised_data

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...
...,...,...
905,ham,"We're all getting worried over here, derek and..."
5192,ham,Oh oh... Den muz change plan liao... Go back h...
3980,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
235,spam,Text & meet someone sexy today. U can find a d...


In [7]:
randomised_data.Label=(randomised_data.Label=="spam").astype(int)

In [8]:
data_train,data_test=train_test_split(randomised_data,test_size=0.2,random_state=1)

In [9]:
data_train=data_train.reset_index(drop=True)
data_train.shape

(4457, 2)

In [10]:
data_test=data_test.reset_index(drop=True)
data_test.shape

(1115, 2)

In [21]:
data_test.Label.value_counts(normalize=True)

0    0.861883
1    0.138117
Name: Label, dtype: float64

In [22]:
data_train.Label.value_counts(normalize=True)

0    0.866951
1    0.133049
Name: Label, dtype: float64

#### Observation: Both the train and test data  has 87% of the SMS messages classified as Non - Spam (ham) and 13% classified as Spam.

In [13]:
data_train.head()

Unnamed: 0,Label,SMS
0,1,URGENT! We are trying to contact U. Todays dra...
1,0,1 I don't have her number and 2 its gonna be a...
2,0,"Party's at my place at usf, no charge (but if ..."
3,0,Mm not entirely sure i understood that text bu...
4,0,Yes we are chatting too.


In [14]:
data_test.head()

Unnamed: 0,Label,SMS
0,0,Good night my dear.. Sleepwell&amp;Take care
1,0,Sen told that he is going to join his uncle fi...
2,0,Thank you baby! I cant wait to taste the real ...
3,0,When can ü come out?
4,0,No. Thank you. You've been wonderful


In [15]:
## Remove punctuatuions form sms
data_train["SMS"]=data_train["SMS"].replace("\W", " ", regex=True)
data_test["SMS"]=data_test["SMS"].replace("\W", " ", regex=True)

In [16]:
data_train.head()

Unnamed: 0,Label,SMS
0,1,URGENT We are trying to contact U Todays dra...
1,0,1 I don t have her number and 2 its gonna be a...
2,0,Party s at my place at usf no charge but if ...
3,0,Mm not entirely sure i understood that text bu...
4,0,Yes we are chatting too


In [25]:
data_test.head()

Unnamed: 0,Label,SMS
0,0,Good night my dear Sleepwell amp Take care
1,0,Sen told that he is going to join his uncle fi...
2,0,Thank you baby I cant wait to taste the real ...
3,0,When can ü come out
4,0,No Thank you You ve been wonderful
