##SMS Spam Detection

In our day-to-day lives, we receive a large number of spam/junk messages either in the form of Text(SMS) or E-mails. It is important to filter these spam messages since they are not truthful or trustworthy.

In this case study, we apply various machine learning algorithms to categorize the messages depending on whether they are spam or not.

In [0]:
import nltk
import pandas as pd
import csv

In [0]:
messages = pd.read_csv('SMSSpamCollection.txt', sep='\t', names=["label", "message"])

In [3]:
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
messages.shape

(5572, 2)

In [5]:
messages.groupby('label').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


**Target** is the class/category to which you will assign the data.

* In this case, you aim to identify whether the message is spam or not.

* By observing the columns, the label column has values Spam or Ham . We can call this case study a Binary Classification, since it has only two possible outcomes.

In [0]:
#Identifying the outcome/target variable.

message_target=messages['label'] 

##Tokenization

**Tokenization** is a method to split a sentence/string into substrings. These substrings are called **tokens**.

In Natural Language Processing (NLP), tokenization is the initial step in preprocessing. Splitting a sentence into tokens helps to remove unwanted information in the raw text such as white spaces, line breaks and so on.


In [7]:
nltk.download('all')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

In [0]:
def split_tokens(message):

  message=message.lower()   # This and this

  #message = str(message) #convert bytes into proper unicode

  word_tokens =word_tokenize(message)

  return word_tokens

In [0]:
messages['tokenized_message'] = messages.apply(lambda row: split_tokens(row['message']),axis=1)

In [10]:
messages['tokenized_message'][1]

['ok', 'lar', '...', 'joking', 'wif', 'u', 'oni', '...']

## Lemmatization

* Lemmatization is a method to convert a word into its base/root form.

* Lemmatizer removes affixes of the words present in its dictionary.

In [0]:
from nltk.stem.wordnet import WordNetLemmatizer

def split_into_lemmas(message):

    lemma = []

    lemmatizer = WordNetLemmatizer()

    for word in message:

        a=lemmatizer.lemmatize(word)

        lemma.append(a)

    return lemma

   

messages['lemmatized_message'] = messages.apply(lambda row: split_into_lemmas(row['tokenized_message']),axis=1)

In [12]:
messages.head(10)

Unnamed: 0,label,message,tokenized_message,lemmatized_message
0,ham,"Go until jurong point, crazy.. Available only ...","[go, until, jurong, point, ,, crazy.., availab...","[go, until, jurong, point, ,, crazy.., availab..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, ..., joking, wif, u, oni, ...]","[ok, lar, ..., joking, wif, u, oni, ...]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, so, early, hor, ..., u, c, alrea...","[u, dun, say, so, early, hor, ..., u, c, alrea..."
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, i, do, n't, think, he, goes, to, usf, ,,...","[nah, i, do, n't, think, he, go, to, usf, ,, h..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...,"[freemsg, hey, there, darling, it, 's, been, 3...","[freemsg, hey, there, darling, it, 's, been, 3..."
6,ham,Even my brother is not like to speak with me. ...,"[even, my, brother, is, not, like, to, speak, ...","[even, my, brother, is, not, like, to, speak, ..."
7,ham,As per your request 'Melle Melle (Oru Minnamin...,"[as, per, your, request, 'melle, melle, (, oru...","[a, per, your, request, 'melle, melle, (, oru,..."
8,spam,WINNER!! As a valued network customer you have...,"[winner, !, !, as, a, valued, network, custome...","[winner, !, !, a, a, valued, network, customer..."
9,spam,Had your mobile 11 months or more? U R entitle...,"[had, your, mobile, 11, months, or, more, ?, u...","[had, your, mobile, 11, month, or, more, ?, u,..."


## Stop Word Removal

Stop words are commons words that do not add any relevance for classification (For eg. “the”, “a”, “an”, “in” etc.). Hence, it is essential to remove these words.

In [0]:
from nltk.corpus import stopwords



def stopword_removal(message):

    stop_words = set(stopwords.words('english'))

    filtered_sentence = []

    filtered_sentence = ' '.join([word for word in message if word not in stop_words])

    return filtered_sentence



messages['preprocessed_message'] = messages.apply(lambda row: stopword_removal(row['lemmatized_message']),axis=1)



Training_data=pd.Series(list(messages['preprocessed_message']))

Training_label=pd.Series(list(messages['label']))

## Term Document Matrix

* The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of terms in a collection of documents.

* In a TDM, the rows represent documents and columns represent the terms.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)), max_df = 0.7)

#tf_vectorizer = CountVectorizer(ngram_range=(1, 2))

Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)

message_data_TDM = Total_Dictionary_TDM.transform(Training_data)

In [15]:
message_data_TDM.shape

(5572, 40377)

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)), max_df = 0.7)

Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)

message_data_TFIDF = Total_Dictionary_TFIDF.transform(Training_data)

In [17]:
message_data_TFIDF.shape

(5572, 40377)

In [0]:
from sklearn.model_selection import train_test_split  #Splitting the data for training and testing

train_data,test_data, train_label, test_label = train_test_split(message_data_TDM, Training_label, test_size=.1)

In [19]:
from sklearn.ensemble import RandomForestClassifier

rfclassifier = RandomForestClassifier(max_depth=5, n_estimators=15, max_features=60,random_state=10)

rfclassifier = rfclassifier.fit(train_data, train_label)

rfscore=rfclassifier.score(test_data, test_label)

print('Random Forest classification after model tuning',rfscore)

Random Forest classification after model tuning 0.8566308243727598


In [20]:
from sklearn.svm import SVC

svcclassifier = SVC(kernel="linear", C=10,random_state=10)

svcclassifier = svcclassifier.fit(train_data, train_label)

score = svcclassifier.score(test_data, test_label)

print('SVM Classifier : ',score)

SVM Classifier :  0.985663082437276
