# About this project


Spam messages are unsolicited and unwanted messages. Fraudsters use spam messages to trick people into giving them your personal information — things like your password, account number, or even credit card information.

These messages are designed in such a way people fall for it. This is because it is difficult for people with little knowledge about scams to determine if sms is from a scammer.



In this project, I will build an application that can help determine if an SMS is spam or not. The project is all about teaching the computer how to classify SMS as spam or not spam in order to help us determine whether an SMS is spam or not. To do that, I will use the **Multinomial Naive Bayes algorithm** along with a dataset of 5,572 SMS messages that are already classified by humans.

For this project, my goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so i expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

THIS IS A MACHINE LEARNING CLASSIFICATION PROBLEM


In [44]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction import DictVectorizer 
from sklearn.naive_bayes import MultinomialNB


### Exploratory Data analysis (EDA)

In [2]:
#read the sms data
data=pd.read_csv("SMSSpamCollection",sep='\t',header=None,names=['Label', 'SMS'])

In [3]:
data.shape

(5572, 2)

In [4]:
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
(data["Label"]=="ham").value_counts(normalize=True)

True     0.865937
False    0.134063
Name: Label, dtype: float64

### Observation: 
- The data set has two columns Label and sms.
- The label column has two unique values ham(not spam) and Spam. 
- The SMS column contains different unique messages. This messages are labled on the label column.
- The data has 5572 rows
- Almost 87% of the SMS messages are classified as Non - Spam (ham) and the remaining 13% are classified as Spam.

In [6]:
## Randomise the dataset
randomised_data=data.sample(frac=1,random_state=1)
randomised_data

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...
...,...,...
905,ham,"We're all getting worried over here, derek and..."
5192,ham,Oh oh... Den muz change plan liao... Go back h...
3980,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
235,spam,Text & meet someone sexy today. U can find a d...


In [7]:
#convert the target(label) to numerical feature
randomised_data.Label=(randomised_data.Label=="spam").astype(int)

In [8]:
data_train,data_test=train_test_split(randomised_data,test_size=0.2,random_state=1)

In [23]:
y_train=data_train["Label"]
y_test=data_test["Label"]

In [29]:
del data_train["Label"]
del data_test["Label"]

In [30]:
data_train=data_train.reset_index(drop=True)
data_train.shape

(4457, 1)

In [31]:
data_test=data_test.reset_index(drop=True)
data_test.shape

(1115, 1)

#### Observation: Both the train and test data  has 87% of the SMS messages classified as Non - Spam (ham) and 13% classified as Spam.

In [34]:
data_train.head()

Unnamed: 0,SMS
0,urgent we are trying to contact u todays dra...
1,1 i don t have her number and 2 its gonna be a...
2,party s at my place at usf no charge but if ...
3,mm not entirely sure i understood that text bu...
4,yes we are chatting too


In [35]:
data_test.head()

Unnamed: 0,SMS
0,good night my dear sleepwell amp take care
1,sen told that he is going to join his uncle fi...
2,thank you baby i cant wait to taste the real ...
3,when can ü come out
4,no thank you you ve been wonderful


In [36]:
## Remove punctuatuions form sms
data_train["SMS"]=data_train["SMS"].replace("\W", " ", regex=True)
data_test["SMS"]=data_test["SMS"].replace("\W", " ", regex=True)

In [37]:
# transform letter to lower case
data_train["SMS"]=data_train["SMS"].str.lower()
data_test["SMS"]=data_test["SMS"].str.lower()

In [38]:
data_train.head()

Unnamed: 0,SMS
0,urgent we are trying to contact u todays dra...
1,1 i don t have her number and 2 its gonna be a...
2,party s at my place at usf no charge but if ...
3,mm not entirely sure i understood that text bu...
4,yes we are chatting too


In [39]:
data_test.head()

Unnamed: 0,SMS
0,good night my dear sleepwell amp take care
1,sen told that he is going to join his uncle fi...
2,thank you baby i cant wait to taste the real ...
3,when can ü come out
4,no thank you you ve been wonderful


In [41]:
#convert sms to numeric features with dictvectoriser
train_dicts=data_train.to_dict(orient='record')
train_dicts[0]

  train_dicts=data_train.to_dict(orient='record')


{'SMS': 'urgent  we are trying to contact u  todays draw shows that you have won a  800 prize guaranteed  call 09050001295 from land line  claim a21  valid 12hrs only'}

In [47]:
#convert sms to numeric features with dictvectoriser
test_dicts=data_test.to_dict(orient='record')
test_dicts[0]

  test_dicts=data_test.to_dict(orient='record')


{'SMS': 'good night my dear   sleepwell amp take care'}

In [48]:
X_test=dv.transform(test_dicts)

In [42]:
dv=DictVectorizer(sparse=False)
dv.fit(train_dicts)
X_train=dv.transform(train_dicts)

In [43]:
dv.get_feature_names()



['SMS=   ',
 'SMS=       ',
 'SMS=    are you in the pub ',
 'SMS=    oh well  c u later',
 'SMS=    ok  i feel like john lennon ',
 'SMS=    photoshop makes my computer shut down ',
 'SMS=    that s not v romantic ',
 'SMS=    yeah  lol  luckily i didn t have a starring role like you ',
 'SMS=   but your not here    ',
 'SMS=  am on a train back from northampton so i m afraid not ',
 'SMS=  am on my way',
 'SMS=  and don t worry we ll have finished by march   ish ',
 'SMS=  free message  thanks for using the auction subscription service  18   150p msgrcvd 2 skip an auction txt out  2 unsubscribe txt stop customercare 08718726270',
 'SMS=  how s things  just a quick question ',
 'SMS=  im    on the snowboarding trip  i was wondering if your planning to get everyone together befor we go  a meet and greet kind of affair  cheers  ',
 'SMS=  lt   gt   in mca  but not conform ',
 'SMS=  lt   gt   mins but i had to stop somewhere first ',
 'SMS=  lt decimal gt  m but its not a common car her

In [45]:
X_train

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [73]:
nb_model = MultinomialNB()

# Train the model
nb_model.fit(X_train, y_train)

#Predict labels for the test set
y_pred = nb_model.predict_proba(X_test)[:,1]
#if y_pred >=0.5:
    #print(y_pred)
# Evaluate the model
accuracy = nb_model.score(X_test, y_test)
print("Accuracy:", accuracy)

[False False False ... False False False]
Accuracy: 0.862780269058296


In [74]:
max(y_pred)

0.5084095678235732

In [None]:
'WINNER!! This is the secret code to unlock the money: C3421.'


In [None]:
"Sounds good, Tom, then see u there"


In [82]:
# Preprocess the new SMS message
new_sms = "WINNER!! This is the secret code to unlock the money: C3421."
new_sms_dict = {'SMS': new_sms}
new_sms_encoded = dv.transform([new_sms_dict])

# Make predictions using the trained model
prediction = nb_model.predict_proba(new_sms_encoded)[:,1]
print(prediction)
# Print the prediction
if prediction >=0.5 :
    print("The SMS is classified as spam.")
else:
    print("The SMS is classified as non-spam.")


[0.13304914]
The SMS is classified as non-spam.


In [79]:
new_sms

'Sounds good, Tom, then see u there'

In [None]:
def 