P05 D Panchal

---
- Python, using Pandas, NumPy and Re 
---
# Create Spam filter for messages, using Naive Bayes Algorithm

Using Naive Bayes Algorithm with an *existing data set* of messages, we are going to create spam filter for classifying new messages. 

As expected this will require 'teaching' computer with already *labelled* messages. We have a dataset of `5,572` SMS messages that have been classified by humans. 

We will use `80%` of this data set for training the algorithm, and remaining `20%` for testing. The algorithm we will use will be Multinomial Naive Bayes.





In [1]:
import pandas as pd

sms_data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print(sms_data.shape)
sms_data.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Proportion of spam
The data set classified by humans has total of 5572 messages. It is important at this stage to know what proportion of this is *spam*. 

From below we can see, **`13.4%`** are spam and `86.6%` are non-spam.

In [2]:
sms_data['Label'].value_counts(normalize=True)


ham     0.865937
spam    0.134063
Name: Label, dtype: float64

### Splitting into Training Set and Test Set
We will first *randomize* the data set, before splitting. Then we will split into the training set and Test set. Test set will be 20% of the size of original data set.
We are going to take all rows upto 4458 into training set, and below it to Test set.

In [3]:
sms_randomized = sms_data.sample(frac=1, random_state=1)

#find the slice point for 80% of data set
training_set_last_row_num = round(len(sms_randomized)*0.8)
print(training_set_last_row_num)

training_set = sms_randomized[:training_set_last_row_num].reset_index(drop=True)
test_set = sms_randomized[training_set_last_row_num:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

4458
(4458, 2)
(1114, 2)


### Proportions check in test set
Once again, let us check the percentage of spam in the newly created training set and test set. We expect it to be similar to the original set.
It has come out to be 13.4% and 13.2% which **is very close** to the percentage of main data set. This seems as expected.

In [4]:
training_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [5]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

### Cleaning the data

From messages we are going to remove *symbols*, just keep the space, and make all *lowercase*

In [6]:
training_set['SMS'] = training_set['SMS'].str.replace('\W',' ')
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### List all unique words
We need to read spam and non spam messages and make a list of **unique vocabulary**. 

We have thus found it is **`7783`** words that cover entire training data set.

In [7]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)

vocabulary = list(set(vocabulary))


In [8]:
len(vocabulary)

7783

### Create a table for vocabulary
Once we got list of vocabulary, next thing we need is each of word counted against number of times used in each of messages. It will be a large table. It has 7783 columns.

In [9]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1


In [10]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


### Join the two tables
With Pandas concatenate, we have merged the test data table, with the massive table of vocabulary.

In [11]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


### Calculating the constant values

For the algorithm, there are **six values** that will have to come out of training data set. These remain constant and dont need to be calculated when filtering a new message. 

Benefit of this Naive Bayes Algorithm is that we can run this long calculation once and keep result *constants* ready ahead of test data. This means computer does not have to perform calculation of thousands of columns again and again for each time when new message arrives. Low cost of computation, and speed, is one of the advantages of this algorithm. 


In [12]:
# p_ham & p_spam
p_ham, p_spam = training_set_clean['Label'].value_counts(normalize=True)
print(p_spam)
print(p_ham)

#Breaking training set into spam and ham set
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# N spam
num_of_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = num_of_words_per_spam_message.sum()

# N ham
num_of_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = num_of_words_per_ham_message.sum()

# Length of unique words
n_vocabulary = len(vocabulary)

# Laplace value
alpha = 1




0.13458950201884254
0.8654104979811574


### Calculating the Parameters
For the values belonging to unique words in vocabulary, similarly we will calculate and keep the corresponding numerical values ready. These will be required while coming across words in test message.

These values are called *parameters*

In [13]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    # num of times in spam
    n_word_given_spam = spam_messages[word].sum()
    # probability of a word given it is spam
    p_word_given_spam = (n_word_given_spam + alpha)/(n_spam +(alpha*n_vocabulary))
    parameters_spam[word] = p_word_given_spam
    
    # num of times in ham
    n_word_given_ham = ham_messages[word].sum()
    # proability of a word given it is ham
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham +(alpha*n_vocabulary))
    parameters_ham[word] = p_word_given_ham
    

### Ready to classify a new message
Let us define a function that classifies a given text, and prints if it is `spam` or `ham`, based on which of two probabilities is greater. 

In [14]:
# will be using regular expressions
import re

def classify(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
    # ratio of p_spam to p_ham, to give idea of how prominent is difference
    print('ratio :', p_spam_given_message/p_ham_given_message)


### Lets begin
Now that the constants and parameters generated using test data are **ready**, and the code for performing test is also ready, let us classify this *new* message,

    'WINNER!! This is the secret code to unlock the money: C3421.'

The result show that the *Probability of Spam* is **greater** than *Probability of non-spam*. 
For reference, P of Spam is 69 times higher than P of non-spam for this message.

We will another one,

    'Sounds good, Tom, then see u there'

The result show that the *Probability of Spam* is **less** than *Probability of non-spam*. 


In [15]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
ratio : 69.60582447618044


In [16]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham
ratio : 6.609403256580054e-05


**Indeed, we got the correct** results from algorithm. In first message we got `Label Spam`. And the next message we got `Label: Ham`
This is good result. Next we will run this program on whole set of `1114` messages and see how many of them are predicted same as classified by human. If all of them as same, accuracy will be 100%.

### Measure accuracy
Now time to apply this classification to more than one, that is a set of messages. We have these in *test data set*. This will help us measure how many predictions were same as human classification.

We will modify and use the above function, this time it will return a label rather than printing a label.

In [17]:
def classify_test_set(message):    
   
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [18]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


### Compare result of algorithm with human classification

There are ***only 14*** messages that were **not** correctly classified by Algorithm. The other **1100 messages were classified by Algorithm exactly same as human classified**.

This proportion amounts to an amazing accuracy as below.

In [19]:
correct = 0
total = test_set.shape[0]
    
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy :', 100*correct/total, ' %')



Correct: 1100
Incorrect: 14
Accuracy : 98.74326750448833  %


# Outcome and Accuracy

**After trying the classification on more than 1100 individual SMS messages in the test set, we still got accuracy of `98.74%`.** 

**This is an excellent result achieved by computer program, which could have otherwise taken days of human effort. The same can be applied to filter further new messages without need of manual effort.**