## SMS Spam Analysis

This corpus has been collected from free or free for research sources at the Internet:

- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.

- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.

- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf.

- Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: http://www.esp.uem.es/jmgomez/smsspamcorpus/.

In [51]:
# Import all required libraries
import string
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist 
from nltk.tokenize import regexp_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import ne_chunk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

In [2]:
# Download NLTK stopwords
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/suddhasatwa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# setup warnings
import warnings
warnings.filterwarnings("ignore")

In [4]:
# load SMS Spam dataset
df = pd.read_csv('SMSSpamCollection', delimiter='\t', names=['class', 'text'])
df.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# check for class value counts
df['class'].value_counts()

class
ham     4825
spam     747
Name: count, dtype: int64

In [6]:
# most frequent SMS in each class
df.groupby('class').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


In [7]:
# Utility function to perform
# basic cleaning of SMS text
def process(text):
    """
    Given the input as the SMS Text, perform initial cleaning.
    """
    
    # convert to lower case
    text = text.lower()
    
    # remove punctuations
    text = ''.join([t for t in text if t not in string.punctuation])
    
    # remove stopwords
    text = [t for t in text.split() if t not in stopwords.words('english')]
    
    # perform stemming
    port_stem = PorterStemmer()
    text = [port_stem.stem(t) for t in text]
    
    # return each token
    return text

In [8]:
"""
test run the utility function
on any random SMS Text
"""

# get the no. of rows in the data
df_num_rows = df.shape[0]

# generate a random no. in the range
rand_index = random.randint(0, df_num_rows+1)

# take the random text with
# the random index
text = process(df['text'][rand_index])

# print text
print(f"Original SMS Text: {df['text'][rand_index]}")
print(f"Processed SMS Text: {text}")

Original SMS Text: Sorry, I can't text &amp; drive coherently, see you in twenty
Processed SMS Text: ['sorri', 'cant', 'text', 'amp', 'drive', 'coher', 'see', 'twenti']


In [9]:
# Generating Bi-Grams for a random
# SMS Text from the Corpus.

random_sms = df['text'][rand_index]
print(f"Random SMS Text: {random_sms}")

random_sms_words = random_sms.split(" ")
print(f"Broken random SMS into word tokens (n=1): {random_sms_words}")

rand_sms_bigram = list(nltk.bigrams(random_sms_words))
print(f"Bi-Gram SMS Text: {rand_sms_bigram}")

Random SMS Text: Sorry, I can't text &amp; drive coherently, see you in twenty
Broken random SMS into word tokens (n=1): ['Sorry,', 'I', "can't", 'text', '&amp;', 'drive', 'coherently,', 'see', 'you', 'in', 'twenty']
Bi-Gram SMS Text: [('Sorry,', 'I'), ('I', "can't"), ("can't", 'text'), ('text', '&amp;'), ('&amp;', 'drive'), ('drive', 'coherently,'), ('coherently,', 'see'), ('see', 'you'), ('you', 'in'), ('in', 'twenty')]


In [10]:
""" Implement Bag of Words using Count Vectorizer (Approach # 1)"""

# create an object of Count Vectorizer
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,1))

# transform our SMS text data using this object
sms_spam_cv = vectorizer.fit_transform(df['text'])

In [11]:
# see the final representation of the data.
count_vector_df = pd.DataFrame(
    sms_spam_cv.todense(),
    columns=vectorizer.get_feature_names_out(),
)

# see the dataframe
count_vector_df

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
""" Approach 2: Implement TF-IDF Vectorizer on SMS Spam Data """

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(df['text'])

# Represent the TF-IDF data as a dataframe
sms_tfidf_df = pd.DataFrame(
    tfidf_matrix.todense(),
    columns=vectorizer.get_feature_names_out()
)

# print the dataframe
sms_tfidf_df

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
""" Join the SMS dataset with the TF-IDF dataset """

# getting list of all columns for final dataset
all_columns = df.columns.tolist() + sms_tfidf_df.columns.tolist()

# prepare final dataset with all required columns
joined_df = pd.concat([df, sms_tfidf_df], axis=1, ignore_index=True)
joined_df.columns = all_columns
joined_df.drop('text', axis=1, inplace=True)

# Sample 5 records of the prepared dataset
joined_df.head()

Unnamed: 0,class,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
""" Prepare Final Training data, before split """

# label encoding the target column 'class'
le = LabelEncoder()
class_le = le.fit_transform(df['class'])

# putting the label encoded class into the dataframe
final_train_df = joined_df.copy()
final_train_df.drop('class', axis=1, inplace=True)
final_train_df['class'] = class_le
final_train_df.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud,class
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [46]:
""" Initial Naive Bayes Model """

# break data into train/test
features = final_train_df.drop('class', axis=1)
target = final_train_df['class']
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=101)

# Fit initial model of Logistic Regression
lr = MultinomialNB().fit(x_train, y_train)

# Predict from Test data
y_pred = lr.predict(x_test)

# generate classification report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98      1287
           1       0.64      1.00      0.78       106

    accuracy                           0.96      1393
   macro avg       0.82      0.98      0.88      1393
weighted avg       0.97      0.96      0.96      1393



In [49]:
# Install SMOTE library
!pip install imbalanced-learn

Collecting imbalanced-learn
  Downloading imbalanced_learn-0.12.2-py3-none-any.whl (257 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.0/258.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.2


In [56]:
""" Perform SMOTE on Training data """

# import the library
from imblearn.over_sampling import SMOTE

# perform SMOTE on training data
# by creating an object
sms_smote = SMOTE()
x_train_smote, y_train_smote = sms_smote.fit_resample(x_train, y_train)

In [57]:
""" Fit another Naive Bayes model on SMOTE data """

# Fit initial model of Logistic Regression
lr_new = MultinomialNB().fit(x_train_smote, y_train_smote)

# Predict from Test data
y_pred_smote = lr_new.predict(x_test)

# generate classification report
print(classification_report(y_pred_smote, y_test))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1223
           1       0.91      0.89      0.90       170

    accuracy                           0.98      1393
   macro avg       0.95      0.94      0.94      1393
weighted avg       0.98      0.98      0.98      1393

