# Naive Bayes Classifiers

#### Jessica Morrise

In [126]:
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

If we are willing to assume that all the features in our dataset are independent, we can get a really fast classifier that works extremely well on certain kinds of datasets. Below we code up two examples: Gaussian Naive Bayes and Multinomial Naive Bayes. For both examples, our classifier and sklearn's classifier achieve identical results.

## Part 1: Using the Seeds Dataset

In this part of the lab, we create a simple Gaussian Naive Bayes classifier to train on the seeds dataset. This dataset contains samples measurements of seeds taken from three different species of wheat. The "features" are 7 real-valued measurements: 
* Area
* Perimeter
* Compactness
* Length
* Width
* Asymmetry Coefficient
* Groove Length

The real-valued nature of the features makes this an ideal dataset for training a Gaussian Naive Bayes classifier.

Below we load in the data and split it into a training set and a 40-sample test set.

In [2]:
data=[]
with open('seeds_dataset.txt','r') as f:
    for line in f:
        features = map(float, line.strip().split())
        data.append(features)
        
data = np.array(data)
N = data.shape[0]
train, test = train_test_split(data,test_size=40)
train_features = train[:,:7]
train_labels = train[:,7]
test_features = test[:,:7]
test_labels = test[:,7]

To create a Gaussian Naive Bayes classifier, we calculate the mean and variance of each feature for each of the three species. A uniform prior over the species is assumed. To classify a sample, we calculate the log probability of the sample given each of the three species, and assign it to whichever species yields the highest log likelihood. For this particular test and training set, our accuracy is 92.5%. Not bad.

In [3]:
def gaussian_log_prob(features,mus,var):
    temp = -0.5*(features-mus)**2/var - np.log((2.*np.pi*var)**0.5)
    return np.sum(temp,axis=1)

def accuracy(predicted, actual):
    n_correct =  np.sum(predicted==actual)
    print "\n%d correctly labeled (%.1f percent accuracy)"%(n_correct,100*n_correct/float(np.size(actual)))
    
variances = []
means = []
variances.append(np.var(train_features[train_labels==1],axis=0))
variances.append(np.var(train_features[train_labels==2],axis=0))
variances.append(np.var(train_features[train_labels==3],axis=0))
means.append(np.mean(train_features[train_labels==1],axis=0))
means.append(np.mean(train_features[train_labels==2],axis=0))
means.append(np.mean(train_features[train_labels==3],axis=0))

# Do the classification
log_probs = np.zeros((test_labels.size,3))
for i in xrange(3):
    log_probs[:,i] = gaussian_log_prob(test_features,means[i], variances[i])
predicted_labels = np.argmax(log_probs,axis=1)+1

accuracy(predicted_labels,test_labels)


37 correctly labeled (92.5 percent accuracy)


Training sklearn's GaussianNB classifier on the same training set gives us exactly the same classification accuracy! We basically just hand-coded something from sklearn. Nice work.

In [4]:
#Now use sklearn
gnb = GaussianNB()
gnb.fit(train_features,train_labels)
predicted_gnb_labels = gnb.predict(test_features)
accuracy(predicted_gnb_labels,test_labels)


37 correctly labeled (92.5 percent accuracy)


## Part 2: The Spam Problem

Below we implement a class to encapsulate the methods we used above: fit(X,Y) for fitting on training data $X$ and training labels $Y$, and predict(X) for assigning labels to a test set $X$.

We will use this classifier to mark emails as "spam" or "not spam". Our dataset consists of several thousand emails, labeled with a 1 for spam or a 0 for not spam. Each email is stored in a simplified representation as a word count vector. Thus, the Naive Bayes classifier will not be able to use word order to classify emails, only word counts. The classifier will also make the naive assumption that words in an email are independent of each other. As it turns out, this assumption works rather well for spam vs. not spam.

Rather than Gaussian probabilities, the spam classifier uses the following:

$$p_{ij} = \frac{count(c_i,v_j) + 1}{\sum_{j=1}^n(count(c_i,v_j) + 1}$$

where $i$ is the index over labels, $j$ is the index over vocabulary words, $count(c_i,v_j)$ denotes the number of occurrences of word $v_j$ among all training documents that have label $c_j$

In [107]:
class naiveBayes(object):
    def __init__(self):
        pass
    
    def fit(self, X, Y):
        '''
        X: training data
           shape is (n_samples, n_features)
        Y: training labels
           shape is (n_samples)
        ''' 
        self.K = len(set(Y)) # number of unique labels
        self.labels = list(set(Y))
        N = X.shape[0]
        
        self.p = np.zeros((self.K,X.shape[1]))
        self.prior = np.zeros(self.K)
        for k in xrange(self.K):
            c_k = self.labels[k]
            self.prior[k] = np.sum(Y==c_k)/float(N)
            counts = np.sum(X[Y==c_k,:],axis=0)
            p_j = (counts+1)/float(np.sum(counts+1))
            self.p[k,:] = p_j
 
    def predict(self, X):
        '''
        X: test data
           shape is (n_samples, n_features)
           
        Returns Y: labels of test data
        '''
        N = X.shape[0]
        predicted_labels = np.empty(N)
        for j in xrange(N):
            log_prob = np.sum(np.log(self.p)*X[j],axis=1) + np.log(self.prior)
            k = np.argmax(log_prob)
            predicted_labels[j] = self.labels[k]
        return predicted_labels   

First, load in the training data and labels, then separate them into a training set and test set.

In [19]:
# load in spam features
spam_features = []
with open('SpamFeatures.txt','r') as f:
    for line in f:
        counts = map(float, line.strip().split())
        spam_features.append(counts)
feature_matrix = np.array(spam_features)

# load in spam labels
spam_labels = []
with open('SpamLabels.txt','r') as f:
    for line in f:
        label = int(float(line.strip()))
        spam_labels.append(label)
label_matrix = np.array(spam_labels)

In [131]:
# create a training/test set
train_spam, test_spam = train_test_split(np.hstack((feature_matrix,np.vstack(label_matrix))), test_size=500)

The classifier is surprisingly accurate!

In [132]:
my_nb = naiveBayes()
my_nb.fit(train_spam[:,:-1], train_spam[:,-1])
predicted_spam_labels = my_nb.predict(test_spam[:,:-1])
accuracy(predicted_spam_labels, test_spam[:,-1])


482 correctly labeled (96.4 percent accuracy)


Yet again, sklearn's implementation yields precisely the same accuracy. Why do we even have sklearn?

In [133]:
multi_nb = MultinomialNB()
multi_nb.fit(train_spam[:,:-1], train_spam[:,-1])
predicted_mnb_spam_labels = multi_nb.predict(test_spam[:,:-1])
accuracy(predicted_mnb_spam_labels,test_spam[:,-1])


482 correctly labeled (96.4 percent accuracy)
