# 6. Spam classification

In this problem, we will use the naive Bayes algorithm and an SVM to build a spam classifier.
In recent years, spam on electronic media has been a growing concern. Here, we'll build a classifier to distinguish between real messages, and spam messages. For this class, we will be building a classifier to detect SMS spam messages. We will be using an SMS spam dataset developed by Tiago A. Almedia and Jose Marıa Gomez Hidalgo which is publicly available on http://www.dt.fee.unicamp.br/~tiago/smsspamcollection.

We have split this dataset into training and testing sets and have included them in this assignment as `data/ds6_spam_train.tsv` and `data/ds6_spam_test.tsv`. See `data/ds6_readme.txt` for more details about this dataset. Please refrain from redistributing these dataset files. The goal of this assignment is to build a classifier from scratch that can tell the difference the spam and non-spam messages using the text of the SMS message.

__(a)__ [5 points] Implement code for processing the the spam messages into numpy arrays that can
be fed into machine learning models. Do this by completing the `get_words`, `create_dictionary`, and `transform_text` functions within our provided `src/p06_spam.py`. Do note the corresponding comments for each function for instructions on what specific processing is required. The provided code will then run your functions and save the resulting dictionary into `output/p06_dictionary` and a sample of the resulting training matrix into
`output/p06_sample_train_matrix`.

### Answer:

In [1]:
import collections

import numpy as np

import src.util as util
import src.svm as svm
%load_ext autoreload
%autoreload 2

In [2]:
def get_words(message):
    """Get the normalized list of words from a message string.

    This function should split a message into words, normalize them, and return
    the resulting list. For splitting, you should split on spaces. For normalization,
    you should convert everything to lowercase.

    Args:
        message: A string containing an SMS message

    Returns:
       The list of normalized words from the message.
    """

    # *** START CODE HERE ***
    return message.lower().split()
    # *** END CODE HERE ***


def create_dictionary(messages):
    """Create a dictionary mapping words to integer indices.

    This function should create a dictionary of word to indices using the provided
    training messages. Use get_words to process each message. 

    Rare words are often not useful for modeling. Please only add words to the dictionary
    if they occur in at least five messages.

    Args:
        messages: A list of strings containing SMS messages

    Returns:
        A python dict mapping words to integers.
    """

    # *** START CODE HERE ***
    
    all_words = [word for message in messages for word in get_words(message)]
    word_counts = collections.Counter(all_words)
    words = [word for word in word_counts if word_counts[word]>=5]
    return {words[ind]: ind for ind in range(len(words)) }

    # *** END CODE HERE ***


def transform_text(messages, word_dictionary):
    """Transform a list of text messages into a numpy array for further processing.

    This function should create a numpy array that contains the number of times each word
    appears in each message. Each row in the resulting array should correspond to each 
    message and each column should correspond to a word.

    Use the provided word dictionary to map words to column indices. Ignore words that 
    are not present in the dictionary. Use get_words to get the words for a message.

    Args:
        messages: A list of strings where each string is an SMS message.
        word_dictionary: A python dict mapping words to integers.

    Returns:
        A numpy array marking the words present in each message.
    """
    # *** START CODE HERE ***
    
    m, n = len(messages), len(word_dictionary)
    array = np.zeros((m,n), dtype=int)
    
    #list_words = list(word_dictionary.keys())
    
    for i in range(m):
        words_count = collections.Counter(get_words(messages[i]))
        for word in  words_count:
            if word in word_dictionary:
                array[i,word_dictionary[word]] = words_count[word]   
    return array
    # *** END CODE HERE ***

__(b)__ [10 points] In this question you are going to implement a naive Bayes classifier for spam classification with multinomial event model and Laplace smoothing (refer to class notes on Naive Bayes for details on Laplace smoothing).

Write your implementation by completing the fit naive bayes model and
predict from naive bayes model functions in `src/p06_spam.py`.
`src/p06_spam.py` should then be able to train a Naive Bayes model, compute your prediction accuracy and then save your resulting predictions to `output/p06_naive_bayes_predictions`. Remark. If you implement naive Bayes the straightforward way, you'll find that the computed $p(x|y)= \prod_i p(x_i|y)$ often equals zero. This is because $p(x|y)$, which is the product of many numbers less than one, is a very small number. The standard computer representation of real numbers cannot handle numbers that are too small, and instead rounds them off to zero. (This is called “underflow.”) You’ll have to find a way to compute Naive Bayes’ predicted class labels without explicitly representing very small numbers such
as $p(x|y)$. 

__Hint:__ Think about using logarithms.

### Answer:

In [3]:
def fit_naive_bayes_model(matrix, labels):
    """Fit a naive bayes model.

    This function should fit a Naive Bayes model given a training matrix and labels.

    The function should return the state of that model.

    Feel free to use whatever datatype you wish for the state of the model.

    Args:
        matrix: A numpy array containing word counts for the training data
        labels: The binary (0 or 1) labels for that training data

    Returns: The trained model
    """

    # *** START CODE HERE ***
    
    m, n = matrix.shape
    phi_y = np.mean(labels)
    phi_k_y1 = (matrix[labels==1].sum(axis = 0) +1)/(matrix[labels==1].sum()+n)
    phi_k_y0 = (matrix[labels==0].sum(axis = 0) +1)/(matrix[labels==0].sum()+n)
    return phi_y, phi_k_y0, phi_k_y1

    # *** END CODE HERE ***

In [4]:
def predict_from_naive_bayes_model(model, matrix):
    """Use a Naive Bayes model to compute predictions for a target matrix.

    This function should be able to predict on the models that fit_naive_bayes_model
    outputs.

    Args:
        model: A trained model from fit_naive_bayes_model
        matrix: A numpy array containing word counts

    Returns: A numpy array containg the predictions from the model
    """
    # *** START CODE HERE ***
    phi_y, phi_k_y0, phi_k_y1 = model
    log_p_y_1 = matrix @ np.log(phi_k_y1) + np.log(phi_y)
    log_p_y_0 = matrix @ np.log(phi_k_y0) + np.log(1-phi_y)
    return (log_p_y_1 >= log_p_y_0)
    # *** END CODE HERE ***

__(c)__ [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a particular class. We can try to get an informal sense of how indicative token $i$ is for the SPAM class by looking at:

\begin{align*}
\log\frac{P(x_j=i|y=1)}{P(x_j=i|y=0)} = \log\frac{P(\mbox{token $i|$ email is SPAM)}}{P (\mbox{token $i|$email is NOTSPAM)}}.
\end{align*}

Complete the `get_top_five_naive_bayes_words` function within the provided code using the above formula in order to obtain the $5$ most indicative tokens.

The provided code will print out the resulting indicative tokens and then save thm to
`output/p06_top_indicative_words`.

### Answer:

In [5]:
def get_top_five_naive_bayes_words(model, dictionary):
    """Compute the top five words that are most indicative of the spam (i.e positive) class.

    Ues the metric given in 6c as a measure of how indicative a word is.
    Return the words in sorted form, with the most indicative word first.

    Args:
        model: The Naive Bayes model returned from fit_naive_bayes_model
        dictionary: A mapping of word to integer ids

    Returns: The top five most indicative words in sorted order with the most indicative first
    """
    # *** START CODE HERE ***
    
    phi_y, phi_k_y0, phi_k_y1 = model
    top_5_index = np.argsort(-np.log(phi_k_y1) + np.log(phi_k_y0))[:5]
    reverced_dictionary = {dictionary[k]:k for k in dictionary}
    return [reverced_dictionary[i] for i in top_5_index]
    # *** END CODE HERE ***

__(d)__ [2 points] Support vector machines (SVMs) are an alternative machine learning model that we discussed in class. We have provided you an SVM implementation (using a radial basis function (RBF) kernel) within `src/svm.py` (You should not need to modify that code). One important part of training an SVM parameterized by an RBF kernel is choosing an appropriate kernel radius.

Complete the `compute_best_svm_radius` by writing code to compute the best SVM radius which maximizes accuracy on the validation dataset.

The provided code will use your `compute_best_svm_radius` to compute and then write the best radius into `output/p06_optimal_radius`.

### <font color=red> Answer:</font>

In [6]:
def compute_best_svm_radius(train_matrix, train_labels, val_matrix, val_labels, radius_to_consider):
    """Compute the optimal SVM radius using the provided training and evaluation datasets.

    You should only consider radius values within the radius_to_consider list.
    You should use accuracy as a metric for comparing the different radius values.

    Args:
        train_matrix: The word counts for the training data
        train_labels: The spma or not spam labels for the training data
        val_matrix: The word counts for the validation data
        val_labels: The spam or not spam labels for the validation data
        radius_to_consider: The radius values to consider
    
    Returns:
        The best radius which maximizes SVM accuracy.
    """
    # *** START CODE HERE ***
    best_accuracy = 0
    best_radius = None
    for radius in radius_to_consider:
        svm_predict = svm.train_and_predict_svm(train_matrix, train_labels, val_matrix, radius)
        current_accuracy = np.mean(svm_predict == val_labels)
        if current_accuracy > best_accuracy:
            best_accuracy = current_accuracy
            best_radius = radius
    return best_radius
    # *** END CODE HERE ***

In [7]:
def main():
    train_messages, train_labels = util.load_spam_dataset('data/ds6_train.tsv')
    val_messages, val_labels = util.load_spam_dataset('data/ds6_val.tsv')
    test_messages, test_labels = util.load_spam_dataset('data/ds6_test.tsv')
    
    dictionary = create_dictionary(train_messages)

    util.write_json('output/p06_dictionary', dictionary)

    train_matrix = transform_text(train_messages, dictionary)

    np.savetxt('output/p06_sample_train_matrix', train_matrix[:100,:])

    val_matrix = transform_text(val_messages, dictionary)
    test_matrix = transform_text(test_messages, dictionary)

    naive_bayes_model = fit_naive_bayes_model(train_matrix, train_labels)

    naive_bayes_predictions = predict_from_naive_bayes_model(naive_bayes_model, test_matrix)

    np.savetxt('output/p06_naive_bayes_predictions.txt', naive_bayes_predictions)

    naive_bayes_accuracy = np.mean(naive_bayes_predictions == test_labels)

    print('Naive Bayes had an accuracy of {} on the testing set'.format(naive_bayes_accuracy))

    top_5_words = get_top_five_naive_bayes_words(naive_bayes_model, dictionary)

    print('The top 5 indicative words for Naive Bayes are: ', top_5_words)

    util.write_json('output/p06_top_indicative_words', top_5_words)

    optimal_radius = compute_best_svm_radius(train_matrix, train_labels, val_matrix, val_labels, [0.01, 0.1, 1, 10])

    util.write_json('output/p06_optimal_radius', optimal_radius)

    print('The optimal SVM radius was {}'.format(optimal_radius))

    svm_predictions = svm.train_and_predict_svm(train_matrix, train_labels, test_matrix, optimal_radius)

    svm_accuracy = np.mean(svm_predictions == test_labels)

    print('The SVM model had an accuracy of {} on the testing set'.format(svm_accuracy, optimal_radius))

In [None]:
main()

Naive Bayes had an accuracy of 0.978494623655914 on the testing set
The top 5 indicative words for Naive Bayes are:  ['claim', 'won', 'prize', 'tone', 'urgent!']
The optimal SVM radius was 0.1
