##  Building a Spam Detector - I
You’ve learnt all the basic preprocessing steps required for most text analytics applications. In this section, you will learn how to apply these steps to build a spam detector.

 

Until now, you had learnt how to use the scikit-learn library to train machine learning algorithms. Here, Krishna will demonstrate how to build a spam detector using NLTK library which is, as you might have already realised, is your go-to tool when you’re working with text.

 

Now, it is not necessary for you to learn how to use NLTK’s machine learning functions. But it’s always nice to have knowledge of more than one tool. More importantly, he’ll demonstrate how to extract features from the raw text without using the scikit-learn package. So take this demonstration as a bonus as you'll learn how to preprocess text and build a classifier using NLTK. Before getting started, download the Jupyter notebook provided below to follow along:

### SPAM Ham Detection

In [1]:
import random
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [2]:
## Reading the given dataset
spam = pd.read_csv("Data/SMSSpamCollection.txt", sep = "\t", names=["label", "message"])

In [3]:
print(spam.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [4]:
print(spam['message'][2])

Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's


In [5]:
## Converting the read dataset in to a list of tuples, each tuple(row) contianing the message and it's label
data_set = []
for index,row in spam.iterrows():
    data_set.append((row['message'], row['label']))

In [6]:
print(data_set[:5])

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'ham'), ('Ok lar... Joking wif u oni...', 'ham'), ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'spam'), ('U dun say so early hor... U c already then say...', 'ham'), ("Nah I don't think he goes to usf, he lives around here though", 'ham')]


In [7]:
print(len(data_set))

5572


### Preprocessing

In [8]:
## initialise the inbuilt Stemmer and the Lemmatizer
stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

In [9]:
def preprocess(document, stem=True):
    'changes document to lower case, removes stopwords and lemmatizes/stems the remainder of the sentence'

    # change sentence to lower case
    document = document.lower()

    # tokenize into words
    words = word_tokenize(document)

    # remove stop words
    words = [word for word in words if word not in stopwords.words("english")]

    if stem:
        words = [stemmer.stem(word) for word in words]
    else:
        words = [wordnet_lemmatizer.lemmatize(word, pos='v') for word in words]

    # join words to make sentence
    document = " ".join(words)

    return document

In [10]:
## - Performing the preprocessing steps on all messages
messages_set = []
for (message, label) in data_set:
    words_filtered = [e.lower() for e in preprocess(message, stem=False).split() if len(e) >= 3]
    messages_set.append((words_filtered, label))

In [11]:
print(messages_set[0])

(['jurong', 'point', 'crazy..', 'available', 'bugis', 'great', 'world', 'buffet', '...', 'cine', 'get', 'amore', 'wat', '...'], 'ham')


In [12]:
print(messages_set[:5])

[(['jurong', 'point', 'crazy..', 'available', 'bugis', 'great', 'world', 'buffet', '...', 'cine', 'get', 'amore', 'wat', '...'], 'ham'), (['lar', '...', 'joke', 'wif', 'oni', '...'], 'ham'), (['free', 'entry', 'wkly', 'comp', 'win', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', '87121', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'apply', '08452810075over18'], 'spam'), (['dun', 'say', 'early', 'hor', '...', 'already', 'say', '...'], 'ham'), (['nah', "n't", 'think', 'usf', 'live', 'around', 'though'], 'ham')]


Words less than a certain threshold are removed to eliminate special characters such as double exclamation marks, or double dots (the period character). And you won’t lose any information by doing this because there are no words less than two characters other than some stopwords (such as ‘am’, ‘is’, etc.).

 

You’ve already learnt how to create a bag-of-words model by using the NLTK’s CountVectorizer function. However, Krishna will demonstrate how to build a bag-of-words model without using the NLTK function, that is, building the model manually. The first step towards achieving that goal is to create a vocabulary from the text corpus that you have. In the following video, you’re going to learn how to create vocabulary from the dataset.

### Preparing to create features

In [13]:
## - creating a single list of all words in the entire dataset for feature list creation

def get_words_in_messages(messages):
    all_words = []
    for (message, label) in messages:
      all_words.extend(message)
    return all_words

In [14]:
## - creating a final feature list using an intuitive FreqDist, to eliminate all the duplicate words
## Note : we can use the Frequency Distribution of the entire dataset to calculate Tf-Idf scores like we did earlier.

def get_word_features(wordlist):
    #print(wordlist[:10])
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

In [15]:
## - creating the word features for the entire dataset
word_features = get_word_features(get_words_in_messages(messages_set))
print(len(word_features))

8393


In [16]:
word_features

dict_keys(['jurong', 'point', 'crazy..', 'available', 'bugis', 'great', 'world', 'buffet', '...', 'cine', 'get', 'amore', 'wat', 'lar', 'joke', 'wif', 'oni', 'free', 'entry', 'wkly', 'comp', 'win', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', '87121', 'receive', 'question', 'std', 'txt', 'rate', 'apply', '08452810075over18', 'dun', 'say', 'early', 'hor', 'already', 'nah', "n't", 'think', 'usf', 'live', 'around', 'though', 'freemsg', 'hey', 'darling', 'week', 'word', 'back', 'like', 'fun', 'still', 'xxx', 'chgs', 'send', '£1.50', 'rcv', 'even', 'brother', 'speak', 'treat', 'aid', 'patent', 'per', 'request', "'melle", 'melle', 'oru', 'minnaminunginte', 'nurungu', 'vettam', 'set', 'callertune', 'callers', 'press', 'copy', 'friends', 'winner', 'value', 'network', 'customer', 'select', 'receivea', '£900', 'prize', 'reward', 'claim', 'call', '09061701461.', 'code', 'kl341', 'valid', 'hours', 'mobile', 'months', 'entitle', 'update', 'latest', 'colour', 'mobiles', 'camera', '0800298

In [17]:
print(type(word_features))

<class 'dict_keys'>


In [79]:
### Preparing to create a train and test set

In [18]:
## - creating slicing index at 80% threshold
sliceIndex = int((len(messages_set)*.8))
print(sliceIndex)

4457


In [19]:
## - shuffle the pack to create a random and unbiased split of the dataset
random.shuffle(messages_set)

In [22]:
train_messages, test_messages = messages_set[:sliceIndex], messages_set[sliceIndex:]

In [23]:
print(len(train_messages))
print(len(test_messages))

4457
1115


You learnt how to create vocabulary manually using all the words in the text corpus. In the next section, you’ll look at how to create a bag-of-words model.

Why do you think that Naive Bayes is a good choice when it comes to text classification problems such as spam detection

Naive Bayes assumes independence between features. Now, in text classification such as spam detection, only the presence of certain words matter. It doesn't matter if they occur before or after certain words. Hence, Naive Bayes often performs very well on these problems.

### Preparing to create feature maps for train and test data

In [24]:
## creating a LazyMap of feature presence for each of the 8K+ features with respect to each of the SMS messages
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [25]:
train_messages[1]

(['4mths',
  'half',
  'price',
  'orange',
  'line',
  'rental',
  'latest',
  'camera',
  'phone',
  'free',
  'phone',
  '11mths',
  'call',
  'mobilesdirect',
  'free',
  '08000938767',
  'update',
  'or2stoptxt'],
 'spam')

In [26]:
## - creating the feature map of train and test data

training_set = nltk.classify.apply_features(extract_features, train_messages)
testing_set = nltk.classify.apply_features(extract_features, test_messages)

In [100]:
training_set[0]

({'contains(jurong)': False,
  'contains(point)': False,
  'contains(crazy..)': False,
  'contains(available)': False,
  'contains(bugis)': False,
  'contains(great)': False,
  'contains(world)': False,
  'contains(buffet)': False,
  'contains(...)': False,
  'contains(cine)': False,
  'contains(get)': False,
  'contains(amore)': False,
  'contains(wat)': False,
  'contains(lar)': False,
  'contains(joke)': False,
  'contains(wif)': False,
  'contains(oni)': False,
  'contains(free)': False,
  'contains(entry)': False,
  'contains(wkly)': False,
  'contains(comp)': False,
  'contains(win)': False,
  'contains(cup)': False,
  'contains(final)': False,
  'contains(tkts)': False,
  'contains(21st)': False,
  'contains(may)': False,
  'contains(2005.)': False,
  'contains(text)': False,
  'contains(87121)': False,
  'contains(receive)': False,
  'contains(question)': False,
  'contains(std)': False,
  'contains(txt)': False,
  'contains(rate)': False,
  'contains(apply)': False,
  'contain

In [98]:
#print(training_set[0])

In [27]:
print('Training set size : ', len(training_set))
print('Test set size : ', len(testing_set))

Training set size :  4457
Test set size :  1115


### Training

In [28]:
## Training the classifier with NaiveBayes algorithm
spamClassifier = nltk.NaiveBayesClassifier.train(training_set)

### Evaluation

In [30]:
## - Analyzing the accuracy of the test set
#print(nltk.classify.accuracy(spamClassifier, training_set))

In [29]:
## Analyzing the accuracy of the test set
print(nltk.classify.accuracy(spamClassifier, testing_set))

0.979372197309417


In [31]:
## Testing a example message with our newly trained classifier
m = 'CONGRATULATIONS!! As a valued account holder you have been selected to receive a £900 prize reward! Valid 12 hours only.'
print('Classification result : ', spamClassifier.classify(extract_features(m.split())))

Classification result :  spam


In [32]:
n = '''Exams are just a way to test your child's level of understanding and ability to answer questions in a timed environment.

Cuemath's Workbooks for Grades KG-8 are designed by curriculum experts from IIT & Cambridge University to

Strengthen math concepts
Handle all Question formats
Prepare for exams
This exam season might be over but you can help your child be ready for the next. Use the code EXAMS20 to get a 20% OFF + 30% OFF on math workbooks from India's #1 math program'''


In [33]:
p = '''Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's'''

In [34]:
## Testing a example message with our newly trained classifier
m = 'CONGRATULATIONS...! receive I charge only $1000 dollars prize but I am free free free free for you darling,Valid 12 hours only '
print('Classification result : ', spamClassifier.classify(extract_features(m.split())))

Classification result :  spam


In [38]:
## Priting the most informative features in the classifier
print(spamClassifier.show_most_informative_features(10))

Most Informative Features
        contains(urgent) = True             spam : ham    =    189.7 : 1.0
       contains(service) = True             spam : ham    =    103.2 : 1.0
         contains(nokia) = True             spam : ham    =    101.0 : 1.0
          contains(code) = True             spam : ham    =     95.9 : 1.0
         contains(await) = True             spam : ham    =     91.7 : 1.0
       contains(attempt) = True             spam : ham    =     78.9 : 1.0
          contains(rate) = True             spam : ham    =     78.9 : 1.0
           contains(txt) = True             spam : ham    =     77.7 : 1.0
          contains(draw) = True             spam : ham    =     64.9 : 1.0
      contains(landline) = True             spam : ham    =     62.7 : 1.0
None


We’ve got an excellent accuracy of 98% on the test set. Although this is an excellent accuracy, you could further improve it by trying other models.

 

Note that, Krishna has created a bag-of-words representation that’s created from scratch without using the CountVectorizer() function. He has used a binary representation instead of using the number of features to represent each word. In this bag-of-words table, ‘1’ means the word is present whereas ‘0’ means the absence of that word in that document. You can do this by setting the ‘binary’ parameter to ‘True’ in the CountVectorizer() function.

 

You also saw that Krishna used the pickle library to save the model. After creating models, they are saved using the pickle library on the disk. This way, you can even send the models to be used on a different computer or platform.

 
  

The steps that you just saw should convince you that to get excellent results, you need to take extra care of the nuances of the dataset you’re working on. You need to understand the data inside-out to take these steps because these can’t be generalised to every text classifier or even other spam datasets.

 

In [62]:
## storing the classifier on disk for later usage
import pickle
f = open('nb_spam_classifier.pickle', 'wb')
pickle.dump(spamClassifier,f)
print('Classifier stored at ', f.name)
f.close()

Classifier stored at  nb_spam_classifier.pickle
