# SMS Spam Classification
---
### Natural Language Processing Using Machine Learning

In this project, we will be using algorithms such as:
* Multinomial Naive Bayes Classifier
* Support Vector Machine

These machine learning algorithms will be used to help us predict spam messages.
___

## Import Necessary Libraries
---

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import string

# For text cleaning
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.svm import LinearSVC

---
## Read Data
---

In [2]:
# Read in data
df = pd.read_csv('SMS_Spam_Collection', sep='\t', names=['Label', 'Message'])
df

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


---
## Exploratory Data Analysis
---

In [3]:
df.shape

(5572, 2)

In [4]:
df['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [5]:
df['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

---
We see that 4825 out of 5572 messages, or 86.6%, are ham messages. This means that any machine learning model we create has to perform **better than 86.6%** to beat random chance.
___

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Label    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [7]:
df.isna().sum()

Label      0
Message    0
dtype: int64

---
## Data Processing 
---
Here, we will calculate the length and number of punctuations in each message for further analysis

In [8]:
# Create new column Length which stores length of the message
df['Length'] = df['Message'].apply(lambda message: len(message))
df

Unnamed: 0,Label,Message,Length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,160
5568,ham,Will ü b going to esplanade fr home?,36
5569,ham,"Pity, * was in mood for that. So...any other s...",57
5570,ham,The guy did some bitching but I acted like i'd...,125


In [9]:
# Define function to count number of punctuations in message
def count_punctuations(message):
    count = 0
    for i in range(len(message)):
        if message[i] in string.punctuation:
            count += 1
    return count

# Create new column Punctuations which stores number of punctuations in message
df['Punctuations'] = df['Message'].apply(lambda message: count_punctuations(message))
df

Unnamed: 0,Label,Message,Length,Punctuations
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2
...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,160,8
5568,ham,Will ü b going to esplanade fr home?,36,1
5569,ham,"Pity, * was in mood for that. So...any other s...",57,7
5570,ham,The guy did some bitching but I acted like i'd...,125,1


---
## Text Cleaning
---
Messages contain many punctuations and stop words (these are the words in any language which do not add much meaning to a sentence. They can be safely ignored without sacrificing the meaning of the sentence), special characters, and many forms of verb.

Now we will clean messages by removing these unnecessary things.

In [10]:
# Create object for Lemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Create function to clean message
def clean(message):
    message = re.sub('[^a-zA-Z0-9]', ' ', message) # remove punctuations
    message = message.lower() # convert to lower case
    
    # Lemmatize and remoe stop words using list comprehenstion
    message = message.split()
    message = [lemmatizer.lemmatize(word) for word in message if word not in stop_words] 
    message = ' '.join(message) # join words back into a message
    return message

df['Message_cleaned'] = df['Message'].apply(lambda message: clean(message))
df

Unnamed: 0,Label,Message,Length,Punctuations,Message_cleaned
0,ham,"Go until jurong point, crazy.. Available only ...",111,9,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,29,6,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,49,6,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2,nah think go usf life around though
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,160,8,2nd time tried 2 contact u u 750 pound prize 2...
5568,ham,Will ü b going to esplanade fr home?,36,1,b going esplanade fr home
5569,ham,"Pity, * was in mood for that. So...any other s...",57,7,pity mood suggestion
5570,ham,The guy did some bitching but I acted like i'd...,125,1,guy bitching acted like interested buying some...


---
Let's take a look at the first cleaned message in our data.
___

In [11]:
df.loc[0, 'Message_cleaned']

'go jurong point crazy available bugis n great world la e buffet cine got amore wat'

---
## Analyzing Differences between Spam and Ham messages

In [12]:
spam = df[df['Label'] == 'spam']
ham = df[df['Label'] == 'ham']
print(f'Shape of spam dataframe: {spam.shape}')
print(f'Shape of ham dataframe: {ham.shape}')

Shape of spam dataframe: (747, 5)
Shape of ham dataframe: (4825, 5)


In [13]:
# Let's compare the length of spam and ham messages
print('Average length of spam messages: {:.2f}'.format(spam['Length'].mean()))
print('Average length of ham messages: {:.2f}'.format(ham['Length'].mean()))

Average length of spam messages: 138.67
Average length of ham messages: 71.48


We can see that spam messages have on average more words than ham messages.
___

In [14]:
# Let's compare the number of punctuations in spam and ham messages
print('Average number of punctuations in spam messages: {:.2f}'.format(spam['Punctuations'].mean()))
print('Average number of punctuations in ham messages: {:.2f}'.format(ham['Punctuations'].mean()))

Average number of punctuations in spam messages: 5.71
Average number of punctuations in ham messages: 3.94


Similarly, it seems that on average, spam messages have more punctuations than ham messages.
___
## Model Building
___
We will split the dataframe into independent variables (i.e. cleaned messages) and dependent variables or target variables (i.e. labels, spam or ham)

In [15]:
X = df['Message_cleaned']
y = df['Label']

---
### Train Test Split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

---
## Dealing with Text (Natural Language Data) Data
---
Let's talk about how we should deal with the text data. We can’t directly feed the text data into the machine learning model as the machine only understands numeric data.

To solve this problem, we will use the concept of <u>**TF-IDF Vectorizer**</u> (Term Frequency-Inverse Document Frequency). It is a standard algorithm to transform the text into a meaningful representation of numbers and is used to fit the machine algorithm for prediction.

We can also use Count Vectorizer (Bag of Words), but Count Vectorizer does not put weights on words, unlike TF-IDF vectorizer.

We can use the TF-IDF vectorizer from the scikit-learn library. Next, create an object of TF-IDF vectorizer and fit_transform to the data, which will convert into a matrix of words and sentences.
___

In [17]:
tfidf = TfidfVectorizer()
X_train_tfidf_vect = tfidf.fit_transform(X_train).toarray()

In [18]:
X_train_tfidf_vect

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [19]:
X_train_tfidf_vect.shape

(3733, 6530)

---
## Pipelining
---
We are doing Pipelining as we need to perform the same procedures for the test data to get predictions; that may be tiresome.

However, what is convenient about this pipeline object is that it can perform all these steps for you in a single cell, which means you can directly provide the data. It will be both vectorize and run the classifier in a single step.

**Note:** When we will predict custom text later, we can directly pass the custom text to Pipeline, and it will help to predict the label

If you don’t know about the Pipeline, it takes a list of tuple where each tuple takes the name set by you and calls any method you want to perform.
___

## Naive Bayes Classifier
---
We will import the MultinomialNB model from the scikit-learn library.

Next, we will create a model named “text_mnb” using Pipeline, where we first provided TfidfVectorizer() object and then MultinomialNB() object. It should be provided in a sequence as we want TfidfVectorizer to first fit and tranform our data which will then be provided to the model. Then, we will fit the Multinomial Naive Bayes model to the X_train and y_train data sets.

Now, every internal functionality will be handled by Pipeline and the steps will be performed accordingly.

In [20]:
# Each tuple takes the name you decide , next you call what you want to occur
text_mnb = Pipeline([('tfidf', TfidfVectorizer()), ('mnb', MultinomialNB())])

In [21]:
# Now you can simply pass the pipeline object to fit the training dataset
text_mnb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('mnb', MultinomialNB())])

In [22]:
# The pipline object will vectorize and predict using the X_test data set accordingly
y_preds_mnb = text_mnb.predict(X_test)

In [23]:
# Predictions
y_preds_mnb

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype='<U4')

In [24]:
# Review the training accuracy
text_mnb.score(X_train, y_train)

0.9774979908920439

In [25]:
#Review the testing accuracy
text_mnb.score(X_test, y_test)

0.9711799891245242

---
### Evaluation Metrics

In [26]:
print(confusion_matrix(y_test, y_preds_mnb))

[[1593    0]
 [  53  193]]


In [27]:
print(classification_report(y_test, y_preds_mnb))

              precision    recall  f1-score   support

         ham       0.97      1.00      0.98      1593
        spam       1.00      0.78      0.88       246

    accuracy                           0.97      1839
   macro avg       0.98      0.89      0.93      1839
weighted avg       0.97      0.97      0.97      1839



---
Here we can see that *“ham”* label got predicted well with 100% recall but *“spam”* label prediction is not ideal. So we can’t say that model is excellent. Model is lacking in predicting spam accurately.

Let’s try out the same problem with SVM (Support Vector Machine).
___

## Linear SVC (Support Vector Classifier)

In [28]:
# Each tuple takes the name you decide , next you call what you want to occur
text_svc = Pipeline([('tfidf', TfidfVectorizer()), ('svc', LinearSVC())])

In [29]:
# Now you can simply pass the pipeline object to fit the training dataset
text_svc.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('svc', LinearSVC())])

In [30]:
# The pipline object will vectorize and predict using the X_test data set accordingly
y_preds_svc = text_svc.predict(X_test)

In [31]:
y_preds_svc

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

In [32]:
# Review the training and testing accuracies

print('Training Accuracy: {:.2f}'.format(text_svc.score(X_train, y_train)))
print('Testing Accuracy: {:.2f}'.format(text_svc.score(X_test, y_test)))

Training Accuracy: 1.00
Testing Accuracy: 0.99


---
### Evaluation Metrics

In [33]:
print(confusion_matrix(y_test, y_preds_svc))

[[1589    4]
 [  18  228]]


In [34]:
print(classification_report(y_test, y_preds_svc))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.98      0.93      0.95       246

    accuracy                           0.99      1839
   macro avg       0.99      0.96      0.97      1839
weighted avg       0.99      0.99      0.99      1839



---
We can see that *“ham”* got predicted with a recall of 100%, and also the *“spam”* label prediction increased signifcantly as compared to the MultinomialNB model.
___

## Prediciting on New SMS

In [35]:
text = 'Congratulations, you have won a lottery of $5000! To Claim Text YES to 7667!'

In [36]:
text = clean(text)
text

'congratulation lottery 5000 claim text yes 7667'

In [37]:
# Using our MulitnomailNB model to directly predict the single message 
text_mnb.predict([text])

array(['spam'], dtype='<U4')

In [38]:
# Using our LinearSVC model to directly predict the single message 
text_svc.predict([text])

array(['spam'], dtype=object)