# Lecture 17 – More Naive Bayes

## DSC 40A, Fall 2021

In [None]:
import pandas as pd
import numpy as np

In this notebook we'll look at how to se Naive Bayes to classify emails as spam or ham (not spam).

<br>

<center>
<img src='https://images2.minutemediacdn.com/image/upload/c_crop,h_1576,w_2800,x_0,y_52/v1554931909/shape/mentalfloss/20997-istock-471531747.jpg?itok=3s4MLcXA' width=400></center>

### The data

Let's load in a dataset of real spam and ham emails.

In [None]:
data = pd.read_csv('data/spam_ham_dataset.csv').get(['text', 'label'])

In [None]:
data

Here's what an email in our DataFrame looks like:

In [None]:
print(data.get('text').iloc[0])

Let's convert all emails to lower-case.

In [None]:
data['text'] = data['text'].str.lower()

In [None]:
print(data.get('text').iloc[0])

### Creating a design matrix

Let's load in a dictionary of words that we can use to differentiate spam and ham emails.

In [None]:
words = ['body', 'click', 'please', 'base64', '2002', 'html', 'subscribed',
         'wrote', 'mortgage', 'align3dcenterfont', 'dear', 'br', 'width10img',
         'divfont', 'im', 'receive', 'list', 'tags', 'web', 'base64', 'click',
         'body', 'please', 'money', 'offer', 'receive', 'contact', 'free',
         'tr', 'removed', 'remove', 'html', 'font', 'form',
         'credit', 'business', 'div']

Before we run Naive Bayes, we need to use the bag-of-words encoding to come up with features.

For a single word, such as `'please'`, we can come up with a column of our design matrix as follows:

In [None]:
data['text'].str.contains('please').astype(int)

To do this for every single word, we have:

In [None]:
def featurize(email):
    '''Returns a Series containing the feature vector for a single email.'''
    return pd.Series({word: int(word in email) for word in words})

In [None]:
design_matrix = data['text'].apply(featurize)

`design_matrix` now contains one row per email and one column per word in our dictionary:

In [None]:
design_matrix

### Setting up the model

We're now ready to run the Naive Bayes algorithm. We could do this ourselves by hand, but we'll instead use the `CategoricalNB` object from `sklearn` that you'll also use in Homework 8.

In [None]:
from sklearn.naive_bayes import CategoricalNB

Let's create a `CategoricalNB` object. `alpha=1` enables smoothing as we've defined it in class.

In [None]:
model = CategoricalNB(alpha=1)

We now need to "fit" the model to the data.

In [None]:
model.fit(X=design_matrix, y=data['label'])

### Making predictions

Now that we've "fit" the model, we can use it to make predictions.

In [None]:
def get_prediction(email, prob=False):
    '''Calls model.predict to determine the predicted class (spam or ham) for a single email.
       If the optional argument prob=True is used, the probability of the prediction for the more likely class is printed.
    '''
    if prob:
        probs = model.predict_proba(featurize(email).values.reshape(1, -1))
        print(f'Probability: {np.round(max(probs[0]), 4) * 100}%')
    return model.predict(featurize(email).values.reshape(1, -1))[0]

In [None]:
get_prediction('''
my name is king triton
please click on this email to receive free credit cards for your new business
''')

In [None]:
get_prediction('''
hey! i had a question on homework 8, part 1d in dsc 40a.
''', prob=True)

### Accuracy

A metric that we often use when classifying is **accuracy**, which is defined as the fraction of data points that were classified correctly.

As it turns out, `sklearn` has a built-in method that calculates accuracy – `.score`.

In [None]:
model.score(X=design_matrix, y=data['label'])

This is telling us that Naive Bayes has an accuracy of 78.34% on the data in `design_matrix`. In practice, however, we'll often care more about the accuracy of a classifer on what is called **test data**, which is data that we didn't use to train the model. (Recall from earlier in the quarter, the purpose of a prediction rule is to make predictions about data for which we don't already know the "right answer".)

### More models

As you know, DSC 40A is called **Theoretical Foundations of Data Science (Part 1)**. That's why we spent most of our time working on paper and talking about the math behind various techniques.

To wrap up the quarter, we'll show you how we can classify emails as spam or ham using a variety of other models in `sklearn`, without walking through the math. You'll see the math in future courses.

**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr_model = LogisticRegression()
lr_model.fit(X=design_matrix, y=data['label'])

In [None]:
lr_model.score(X=design_matrix, y=data['label'])

Let's store our accuracies in a dictionary that we can refer back to later.

In [None]:
model_accs = {}
model_accs['naive bayes'] = model.score(X=design_matrix, y=data['label'])
model_accs['logistic regression'] = lr_model.score(X=design_matrix, y=data['label'])

**Decision Trees**

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X=design_matrix, y=data['label'])

In [None]:
model_accs['decision tree'] = dt_model.score(X=design_matrix, y=data['label'])
model_accs['decision tree']

**Support Vector Machines**

In [None]:
from sklearn.svm import SVC

In [None]:
svm_model = SVC()
svm_model.fit(X=design_matrix, y=data['label'])

In [None]:
model_accs['support vector machine'] = svm_model.score(X=design_matrix, y=data['label'])
model_accs['support vector machine']

**Overall**

In [None]:
model_accs

Note that the code for all three of these new models (Logistic Regression, Decision Trees, and Support Vector Machines) is almost identical. Under the hood, though, there's a lot of math going on. You now have a taste of the math that's involved in making predictions! 😇