# <font color='red'> Spam SMS Classifier<font>
### Machine learning is so powerful that it is helping us in each possible area and saving precious time. One of the use cases of Machine learning is Spam SMS detection, Using Natural Language Processing we can easily classify Spam and non Spam SMS.
# <font color='green'> Download data for spam detection<font>
## https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
# <font color='red'> Data Description : <font> 
### The collection is composed by just one text file, where each line has the correct class followed by the raw message. 
### Ham : Not spam and Spam : Spam
 A subset of 5572 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.
# <font color='red'> Objective : <font> 
### Using machine learning model, our model should be able to classify Spam and normal SMS.   

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import nltk
import re
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [2]:
sms_data = pd.read_csv('smsspamcollection/SMSSpamCollection',sep='\t', names=['label','message'])
sms_data.head(5)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


ham means not spam and spam is spam SMS

In [3]:
# Getting Size of data
sms_data.shape
print("We have 5572 SMS")

We have 5572 SMS


In [4]:
# Check if data is balanced or not
sms_data['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

<font color='red'> This dataset is imbalanced , because number of sample of class ham is significantly larger than number of sample of class spam. But we have 747 spam, we will be able to classify SMS.<font>

In [5]:
# making set of stopwords (Those words which do not play significant role in decision making of text processing)
stopwords = set(stopwords.words('english'))
print(stopwords)
print("There are ",len(stopwords)," in english")

{'down', 'aren', 'this', 'nor', 'mustn', 'under', 'shan', 'have', 'when', "isn't", "didn't", 'of', 'wasn', 'your', 'is', 'once', 'yourselves', 'can', 'he', 'out', 'isn', "mightn't", 'other', "aren't", 'had', 'how', 'being', 'but', "it's", 'was', 'above', 'mightn', 'into', 'now', 't', 'to', 'in', 'does', 'haven', "shouldn't", 'them', 'a', 'o', 'ourselves', 'hers', "couldn't", 'hasn', 'while', 'which', 'itself', 'him', 'such', 'be', 'yours', 's', 'if', 'whom', 'than', 'over', 'theirs', 'and', 'here', 'are', 'she', 'then', 'been', 'me', 'off', 'that', 'hadn', 'couldn', 'didn', 'after', 'll', 'further', 'm', 'myself', 'his', 've', 'no', 'it', 'as', 'from', 'we', 'few', 'because', 'my', 'ours', 'their', 'against', 'before', "don't", 'who', 'y', "wasn't", 'they', "you've", 'between', 'wouldn', "weren't", 'an', 'has', 'there', "that'll", 'below', 'not', "won't", 'both', "you'd", 'during', 'very', "hasn't", 'am', 'the', 'do', 'at', 'don', "wouldn't", 'i', 'these', 'shouldn', "hadn't", 'any', '

# <font color='green'> Text Preprocessing <font>
<ul>
  <li>Stemming : processing of removing suffix from words to bring into root form, ex: going,gone,goes changes to root word go.</li>
  <li>Removing punctuations, take only alphabates.</li>
  <li>Convert words into lowercase.</li>
    <li>Remove stopwords because they don't play significant role.</li>
    <li>Make coupus : list of documents</li>
</ul>

In [6]:
# intialize stemmer object
stemmer = PorterStemmer()
corpus = []
for i in range(len(sms_data)):
    # replace non-alphabates with space
    row_data = re.sub('[^a-zA-Z]'," ",sms_data['message'][i])
    # convert words into lowercase
    row_data = row_data.lower()
    # make list from sentences
    row_data_list = row_data.split()
    # remove stopwords
    important_row_data = [stemmer.stem(word) for word in row_data_list if word not in stopwords]
    data = ''.join(important_row_data)
    # append to corpus
    corpus.append(data)

# <font color='green'> Using "Bag of Words"  method <font>
<ul>
  <li>After text preprocessing , make differnt dimensions for differnt each words in corpus</li>
  <li>If unique words in corpus is d and there are n documents,make a matrix of nxd.</li>
  <li>Each sentence will  be represented by a d-dimension vectors.</li>
    <li>Matrix will be sparse matrix</li>
</ul>
    

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
print(X.shape)

(5572, 5053)


There are 5053 different words,but many of them have very less frequency , lets take only top 3000 most frequent words

In [25]:
# let's take only top 2500 most frequent words, it will reduce time complexit
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
print("Size of X : ",X.shape)
X[0:10]

Size of X :  (5572, 5053)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## label have to classes , ham and spam. but machine should be given numeric value,let's convert it into numeric form, 0 ham and 1 for spam

In [33]:
y = pd.get_dummies(sms_data['label'],drop_first=True)
print("size of y : ",y.shape)
print("Converted label :")
print(y[:5])
print('Original label :')
print(sms_data['label'][:5])

size of y :  (5572, 1)
Converted label :
   spam
0     0
1     0
2     1
3     0
4     0
Original label :
0     ham
1     ham
2    spam
3     ham
4     ham
Name: label, dtype: object


# <font color='green'> Splitting data into train and test set <font>

In [50]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=0)

# <font color='green'> Using Naive Bayes classifier <font>

In [51]:
from sklearn.naive_bayes import MultinomialNB
spam_model = MultinomialNB().fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


# <font color='green'> Prediction <font>

In [52]:
y_pred = spam_model.predict(X_test)

# <font color='green'> Analyzing model <font>

In [53]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)

[[1597    0]
 [ 236    6]]


In [54]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test,y_pred)
print("Model achieved ",round(accuracy*100,3) ,"% accuracy.")

Model achieved  87.167 % accuracy.


In [55]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1597
           1       1.00      0.02      0.05       242

    accuracy                           0.87      1839
   macro avg       0.94      0.51      0.49      1839
weighted avg       0.89      0.87      0.82      1839

