## Programming HW4

## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Avennia Maragh
    - Email: Ahr62@drexel.edu
- Group member 2
    - Name: 
    - Email: 
- Group member 3
    - Name: 
    - Email: 
- Group member 4
    - Name: NA
    - Email: NA

### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

## Spam Classifier (25 points)

Implement a Naive Bayes classification `naiveBayes_classify(word_probs, message)` for classifying an email message into spam or non-spam by using the word probability distributions, word_probs,  learned from a set of training data. 

In this question, you are asked to implement the Naive Bayes method from scratch by implementing the following functions. To simplify the implementation, we assume that any message is equally likely to be spam or not-spam.
* `tokenize(message)`: extracts a set of unique words from the given text message.
* `count_words(training_set)`: creates a dictionary containing the mappings from unique words to the frequencies of the words in 
    spam and non-spam messages in the training set
*  `word_probabilities(counts, total_spams, total_non_spams, k=0.5)`: turns the word_counts into a list of triplets w, p(w | spam) and p(w | ~spam)
* `spam_probability(word_probs, message, total_spams, total_non_spams, k = 0.5)`: computes the probablity of spam for the given message.
* `naiveBayes_classify(word_probs, message, total_spams, total_non_spams, k)`: classifies the message as spam or ham

Using the data set `spam.csv` to evaluate the classification in terms of accuracy, recall, precision, and F1-score.

### Implement the following functions

In [1]:
from collections import Counter, defaultdict
import math,re

def tokenize(message):
    """
    extracts the set of unique words from the given text message
    INPUT:
        message: a piece of text
    OUTPUT:
        a set of unique words
    """
    message = message.lower()                       # convert to lowercase
    all_words = re.findall("[a-z0-9']+", message)   # extract the words
    return set(all_words)                           # remove duplicates

In [2]:
def count_words(training_set):
    """
    creates a dictionary containing the mappings from unique words to the frequencies of the words in 
    spam and non-spam messages in the training set
    INPUT:
        training_set: training set consists of pairs (message, is_spam)
    OUTPUT:
        a map from unique words to their frequencies in spam and non-spam messages
    """
    counts = defaultdict(lambda: [0, 0])
    for message, is_spam in training_set:
        for word in tokenize(message):
            counts[word][0 if is_spam else 1] += 1
    return counts

In [3]:
counts = defaultdict(lambda: [0, 0])

In [4]:
counts["wins"][0]=50
counts["wins"][1]=500
counts["wins"]

[50, 500]

In [5]:
def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
    """
    turns the word_counts into a list of triplets w, p(w | spam) and p(w | ~spam)
    INPUT:
        counts: a maps from unique words to their frequencies in spam and non-spam messages
        total_spams: the total number of spam messages
        total_non_spams: the total number of non-spam messages
        k=0.5: the smoothing parameter, default 0.5
    OUTPUT:
        a list of triples (w, p(w|spam), p(w|non-spam))
    """
    return [(w,
             (spam + k) / (total_spams + 2 * k),
             (non_spam + k) / (total_non_spams + 2 * k))
             for w, (spam, non_spam) in counts.items()]

In [6]:
def spam_probability(word_probs, message, total_spams, total_non_spams, k = 0.5):
    """
    computes the probablity of spam for the given message
    INPUT:
        word_probs: a list of triple (w, p(w|spam), p(w|non-spam))
        message: a message under classification
    OUTPUT:
        the probability of being spam for the message
    HINTS:
        First, get a set of unique words in the mesage.
        Second, sum up all the log probabilities of the unique words in the message.
        Third, get probabilities by taking exponentials of the probabilites (for spam and non-spam).
        Finally, return the ratio of probability of spam over the sum of the probabiliy of spam and the 
        probability of not spam.
    """
    ############YOUR CODE HERE##################
    message_w = tokenize(message) #message wordds
    probable_spam = probable_not_spam = 0.0
    
    for word, prob_spam, prob_not_spam in word_probs:
        if word in message_w:
            probable_spam += math.log(prob_spam)
            probable_not_spam += math.log(prob_not_spam)
        else:
            probable_spam += math.log(1.0 - prob_spam)
            probable_not_spam += math.log(1.0 - prob_not_spam)
            
    prob_spam = math.exp(probable_spam)
    prob_ham = math.exp(probable_not_spam)
    
    return prob_spam / (prob_spam + prob_ham)

In [7]:
def naiveBayes_classify(word_probs, message, total_spams, total_non_spams, k):
    """
    classifies the message as spam or ham
    INPUT:
        word_probs: a list of triples (w, p(w|spam), p(w|non-spam))
        message: the message under classifiation
    OUTPUT:
        'spam' or 'ham' indicating the classification of the message.
    """
    spam_prob = spam_probability(word_probs, message, total_spams, total_non_spams, k)
    
    if spam_prob > 0.5:
        return 'spam'
    else:
        return 'ham'

### Test and Evaluate

In [8]:
import pandas as pd
import numpy as np
spam = pd.read_csv("spam.csv", encoding = 'ISO-8859-1')

In [9]:
spam.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
spam.shape

(5572, 2)

In [11]:
spam['is_spam'] = spam['label'].map({'spam':1, 'ham':0})

In [12]:
spam.head()

Unnamed: 0,label,text,is_spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [13]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(spam['text'], spam['is_spam'], test_size = 0.2, random_state = 0)

In [14]:
y_test = list(y_test.map({0:'ham',1:'spam'}))

In [15]:
training_set = zip(X_train,y_train)

In [16]:
counts = count_words(training_set)

In [17]:
counts

defaultdict(<function __main__.count_words.<locals>.<lambda>()>,
            {'durban': [0, 2],
             'ground': [0, 3],
             'no': [52, 216],
             'amla': [0, 1],
             'kallis': [0, 5],
             'town': [2, 24],
             'this': [73, 193],
             'is': [117, 481],
             'home': [2, 128],
             'am': [8, 161],
             'kavalan': [0, 2],
             'now': [151, 227],
             'to': [372, 970],
             'theatre': [0, 4],
             'going': [3, 133],
             'i': [28, 1296],
             'in': [60, 612],
             'a': [228, 687],
             'few': [0, 36],
             'minutes': [5, 21],
             'escape': [0, 4],
             'watch': [0, 27],
             'we': [36, 215],
             'on': [88, 291],
             'address': [4, 13],
             'gt': [0, 189],
             'hill': [0, 3],
             'moms': [0, 4],
             'right': [1, 63],
             'lt': [0, 189],
             'vic

In [18]:
total_spams = y_train.sum()
total_spams

581

In [19]:
total_non_spams = y_train.shape[0] - total_spams
total_non_spams

3876

In [20]:
word_probs = word_probabilities(counts, total_spams, total_non_spams, k=0.5)

In [21]:
#just check if this works for any given text in the dataset.
naiveBayes_classify(word_probs, spam['text'][2], total_spams, total_non_spams, 0.5)

'spam'

In [22]:
X_train.iloc[0]

'No no:)this is kallis home ground.amla home town is durban:)'

In [23]:
X_test.iloc[0]

'Aight should I just plan to come up later tonight?'

In [24]:
y_pred = []
for i in range(X_test.shape[0]):
    y_pred.append(naiveBayes_classify(word_probs, X_test.iloc[i], total_spams, total_non_spams, 0.5))

In [25]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       949
        spam       0.99      0.87      0.92       166

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [26]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

print("Accuracy score: ", accuracy_score(y_test, y_pred))
print("Recall score: ", recall_score(y_test, y_pred, average = 'weighted'))
print("Precision score: ", precision_score(y_test, y_pred, average = 'weighted'))
print("F1 score: ", f1_score(y_test, y_pred, average = 'weighted'))

Accuracy score:  0.97847533632287
Recall score:  0.97847533632287
Precision score:  0.9786368643629143
F1 score:  0.9778976677801595
