Context

Spam is unsolicited and unwanted messages sent electronically and whose content may be malicious. 
Email spam is sent/received over the Internet while SMS spam is typically transmitted over a mobile network. 
We’ll refer to user that sent spam as ‘spammers’. SMS messages are usually very cheap (if not free) for the user to send, 
making it appealing for unrightful exploitation. This is further aggravated by the fact that SMS is usually regarded 
by the user as a safer, more trustworthy form of communication than other sources, e. g., emails.

The dangers of spam messages for the users are many: undesired advertisement, exposure of private information, 
becoming a victim of a fraud or financial scheme, being lured into malware and phishing websites, involuntary exposition 
to inappropriate content, etc. For the network operator, spam messages result in an increased cost in operations.

In [None]:
#Importing all the libraries to be used
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline    
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from matplotlib.colors import ListedColormap
from sklearn.metrics import precision_score, recall_score, plot_confusion_matrix, classification_report, accuracy_score, f1_score
from sklearn import metrics

In [None]:
df = pd.read_csv(r"C:\Users\vivih\Downloads\spam.csv", encoding='latin-1')
df.head()

In [None]:
df.info()

this dataset has three Unnamed columns that we don't need, so we just drop them
also our label is in string form -> spam and ham, so we map them in numerical form

In [None]:
# Dropping the redundent looking collumns (for this project)
to_drop = ["Unnamed: 2","Unnamed: 3","Unnamed: 4"]
df = df.drop(df[to_drop], axis=1)
# Renaming the columns because I feel fancy today 
df.rename(columns = {"v1":"Target", "v2":"Text"}, inplace = True)
df.head()

In [None]:
#missing values
df.isnull().sum()

In [None]:
#check for duplicates
df.duplicated().sum()

In [None]:
df=df.drop_duplicates(keep='first')

In [None]:
df.shape

The dataset consists of 5,574 messages in English. The data is designated as being ham or spam. Dataframe has two columns. 
The first column is "Target" indicating the class of message as ham or spam and the second "Text" column is the string of text.

Data Exploration

In [None]:
cols= ["#32cd32", "#1e90ff"] 
#first of all let us evaluate the target and find out if our data is imbalanced or not
plt.figure(figsize=(12,8))
fg = sns.countplot(x= df["Target"], palette= cols)
fg.set_title("Count Plot of Classes", color="#000000")
fg.set_xlabel("Classes", color="#000000")
fg.set_ylabel("Number of Data points", color="#000000")



For the purpose of data exploration, i am creating new features

Sum_characters: Number of characters in the text message
Sum_words: Number of words in the text message
Sum_sentence: Number of sentences in the text message

In [None]:
df["Sum_characters"] = df["Text"].apply(len)
df["Sum__words"]=df.apply(lambda row: nltk.word_tokenize(row["Text"]), axis=1).apply(len)
df["Sum_sentence"]=df.apply(lambda row: nltk.sent_tokenize(row["Text"]), axis=1).apply(len)

df.describe().T

In [None]:
plt.figure(figsize=(12,8))
fg = sns.pairplot(data=df, hue="Target",palette=cols)
plt.show(fg)

Note: From the pair plot, we can see a few outliers all in the class ham. This is interesting as 
we could put a cap over one of these. As they essentially indicate the same thing ie the length of SMS

In [None]:
#Dropping the outliers. 
df = df[(df["Sum_characters"]<350)]
df.shape

In [None]:
plt.figure(figsize=(12,8))
fg = sns.pairplot(data=df, hue="Target",palette=cols)
plt.show(fg)

Data cleaning is a very crucial step in any machine learning model, but more so for NLP. 
Without the cleaning process, the dataset is often a cluster of words that the computer doesn’t understand.

In [None]:
# Defining a function to clean up the text
def Clean(Text):
    sms = re.sub('[^a-zA-Z]', ' ', Text) 
    sms = sms.lower()
    sms = sms.split()
    sms = ' '.join(sms)
    return sms

df["Clean_Text"] = df["Text"].apply(Clean)

print("\033[1m\u001b[45;1m The First 5 Texts after cleaning:\033[0m",*df["Clean_Text"][:5], sep = "\n")

Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of your pipeline. 
A tokenizer breaks unstructured data and natural language text into chunks of information that can be 
considered as discrete elements. The token occurrences in a document can be used directly 
as a vector representing that document. 

This immediately turns an unstructured string (text document) into 
a numerical data structure suitable for machine learning. They can also be used directly by a computer 
to trigger useful actions and responses. Or they might be used in a machine learning pipeline 
as features that trigger more complex decisions or behavior.

In [None]:
df["Tokenize_Text"]=df.apply(lambda row: nltk.word_tokenize(row["Clean_Text"]), 
                                 axis=1)

print("\033[1m\u001b[45;1m The First 5 Texts after Tokenizing:\033[0m",
      *df["Tokenize_Text"][:5], sep = "\n")

The process of converting data to something a computer can understand 
is referred to as pre-processing. One of the major forms of pre-processing 
is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words. 

In [None]:
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    filtered_text = [word for word in text if word not in stop_words]
    return filtered_text

df["Nostopword_Text"] = df["Tokenize_Text"].apply(remove_stopwords)

print("\033[1m\u001b[45;1m The First 5 Texts after removing the stopwords:\033[0m",*df["Nostopword_Text"][:5], sep = "\n")

Stemming is a natural language processing technique that lowers inflection in words 
to their root forms, hence aiding in the preprocessing of text, words, and documents for text normalization.

According to Wikipedia, inflection is the process through which a word is 
modified to communicate many grammatical categories, including tense, case, 
voice, aspect, person, number, gender, and mood. Thus, although a word may exist in several inflected forms, 
having multiple inflected forms inside the same text adds redundancy to the NLP process.

Lemmatization entails reducing a word to its canonical or dictionary form. 
The root word is called a ‘lemma’.The method entails assembling the inflected parts 
of a word in a way that can be recognised as a single element. 
The process is similar to stemming but the root words have meaning.

In [None]:
lemmatizer = WordNetLemmatizer()
# lemmatize string
def lemmatize_word(text):
    #word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in text]
    return lemmas

df["Lemmatized_Text"] = df["Nostopword_Text"].apply(lemmatize_word)
print("\033[1m\u001b[45;1m The First 5 Texts after lemitization:\033[0m",*df["Lemmatized_Text"][:5], sep = "\n")

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates 
how relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, 
and the inverse document frequency of the word across a set of documents.

It has many uses, most importantly in automated text analysis, and is very useful for scoring words 
in machine learning algorithms for Natural Language Processing (NLP).

In [None]:
corpus= []
for i in df["Lemmatized_Text"]:
    msg = ' '.join([row for row in i])
    corpus.append(msg)
    
corpus[:5]
print("\033[1m\u001b[45;1m The First 5 lines in corpus :\033[0m",*corpus[:5], sep = "\n")

In [None]:
#Changing text data in to numbers. 
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus).toarray()
#Let's have a look at our feature 
X.dtype

In [None]:
label_encoder = LabelEncoder()
df["Target"] = label_encoder.fit_transform(df["Target"])

Model Building

In [None]:
#Setting values for labels and feature as y and X(we already did X in vectorizing...)
y = df["Target"] 
# Splitting the testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#Testing on the following classifiers
classifiers = [MultinomialNB(), 
               RandomForestClassifier(),
               KNeighborsClassifier(), 
               SVC()]
for cls in classifiers:
    cls.fit(X_train, y_train)

In [None]:
# Dictionary of pipelines and model types for ease of reference
pipe_dict = {0: "NaiveBayes", 1: "RandomForest", 2: "KNeighbours",3: "SVC"}

In [None]:
# Cossvalidation 
for i, model in enumerate(classifiers):
    cv_score = cross_val_score(model, X_train,y_train,scoring="accuracy", cv=10)
    print("%s: %f " % (pipe_dict[i], cv_score.mean()))

NaiveBayes: 0.962856 


Evaluating Models

In [None]:
precision =[]
recall =[]
f1_score = []
trainset_accuracy = []
testset_accuracy = []


for i in classifiers:
    pred_train = i.predict(X_train)
    pred_test = i.predict(X_test)
    prec = metrics.precision_score(y_test, pred_test)
    recal = metrics.recall_score(y_test, pred_test)
    f1_s = metrics.f1_score(y_test, pred_test)
    train_accuracy = model.score(X_train,y_train)
    test_accuracy = model.score(X_test,y_test)    

In [None]:
# initialise data of lists.
data = {'Precision':precision,
'Recall':recall,
'F1score':f1_score,
'Accuracy on Testset':testset_accuracy,
'Accuracy on Trainset':trainset_accuracy}
# Creates pandas DataFrame.
Results = pd.DataFrame(data, index =["NaiveBayes", "RandomForest", "KNeighbours","SVC"])

In [None]:
cmap2 = ListedColormap(["#32cd32", "#1e90ff"])
Results.style.background_gradient(cmap=cmap2)