# Naive Bayes

In this project, you will implement Naïve Bayes to predict if a name is male or female.

In [1]:
import numpy as np
import sys

# add p03 folder
sys.path.insert(0, './p03/')

%matplotlib inline

In [2]:
def hashfeatures(baby, B, FIX):
    v = np.zeros(B)
    for m in range(FIX):
        featurestring = "prefix" + baby[:m]
        v[hash(featurestring) % B] = 1
        featurestring = "suffix" + baby[-1*m:]
        v[hash(featurestring) % B] = 1
    return v

In [3]:
def name2features(filename, B=128, FIX=3, LoadFile=True):
    """
    Output:
    X : n feature vectors of dimension B, (nxB)
    """
    # read in baby names
    if LoadFile:
        with open(filename, 'r') as f:
            babynames = [x.rstrip() for x in f.readlines() if len(x) > 0]
    else:
        babynames = filename.split('\n')
    n = len(babynames)
    X = np.zeros((n, B))
    for i in range(n):
        X[i,:] = hashfeatures(babynames[i], B, FIX)
    return X

It reads every name in the given file and converts it into a 128-dimensional feature vector. </p>

Can you figure out what the features are? (Understanding how these features are constructed will help you later on in the competition.)

We have provided you with a python function genTrainFeatures, which calls this script, transforms the names into features and loads them into memory.

In [4]:
def genTrainFeatures(dimension=128, fix=3):
    """
    function [x,y]=genTrainFeatures
    
    This function calls the python script "name2features.py" 
    to convert names into feature vectors and loads in the training data. 
    
    
    Output: 
    x: n feature vectors of dimensionality d [d,n]
    y: n labels (-1 = girl, +1 = boy)
    """
    
    # Load in the data
    Xgirls = name2features("girls.train", B=dimension, FIX=fix)
    Xboys = name2features("boys.train", B=dimension, FIX=fix)
    X = np.concatenate([Xgirls, Xboys])
    
    # Generate Labels
    Y = np.concatenate([-np.ones(len(Xgirls)), np.ones(len(Xboys))])
    
    # shuffle data into random order
    ii = np.random.permutation([i for i in range(len(Y))])
    
    return X[ii, :], Y[ii]

You can call the following command to load in the features and the labels of all boys and girls names.

In [5]:
X,Y = genTrainFeatures(128)

## The Naïve Bayes Classifier

The Naïve Bayes classifier is a linear classifier based on Bayes Rule. The following questions will ask you to finish these functions in a pre-defined order. <br>
As a general rule, you should avoid tight loops at all cost.

(a) Estimate the class probability P(Y) in  naivebayesPY . This should return the probability that a sample in the training set is positive or negative, independent of its features.

In [15]:
def naivebayesPY(x,y):
    """
    function [pos,neg] = naivebayesPY(x,y);

    Computation of P(Y)
    Input:
        x : n input vectors of d dimensions (nxd)
        y : n labels (-1 or +1) (nx1)

    Output:
        pos: probability p(y=1)
        neg: probability p(y=-1)
    """
    
    # add one positive and negative example to avoid division by zero ("plus-one smoothing")
    y = np.concatenate([y, [-1,1]])
    n = len(y)
    pos = len(y[y == 1])/n
    return pos, 1-pos

#</GRADED>

pos,neg = naivebayesPY(X,Y)

(b) Estimate the conditional probabilities P(X|Y) in  naivebayesPXY . Use a multinomial distribution as model. This will return the probability vectors for all features given a class label.

In [42]:
def naivebayesPXY(x,y):
    """
    function [posprob,negprob] = naivebayesPXY(x,y);
    
    Computation of P(X|Y)
    Input:
        x : n input vectors of d dimensions (nxd)
        y : n labels (-1 or +1) (nx1)
    
    Output:
        PXY_pos: probability vector of p(x|y=1) (1xd)
        PXY_neg: probability vector of p(x|y=-1) (1xd)
    """
    
    # add one positive and negative example to avoid division by zero ("plus-one smoothing")
    n, d = x.shape
    x = np.concatenate([x, np.ones((2,d))])   #add two observations (rows) of just 1s
    y = np.concatenate([y, [-1,1]])           #add two labels for the two rows
    n, d = x.shape
    
    # grab rows that have positive (stored as pos) and negative (neg)
    pos = x[y==1]
    neg = x[y==-1]
    
    # get the total word count (non-distinct) in the positive words, and in the negative words. each is a scalar
    pos_letters = np.sum(pos)
    neg_letters = np.sum(neg)
    
    # sum accross observations to count how many times each word appears in the observations, given that the word is positive (or negative)
    pos_letter_counts = np.sum(pos, axis=0)
    neg_letter_counts = np.sum(neg, axis=0)
    
    #make probabilities given letter counts (pos_letter_counts and neg_letter_counts), and total number of letters (pos_letters and neg_letters)
    PXY_pos = pos_letter_counts/pos_letters
    PXY_neg = neg_letter_counts/neg_letters
    
    return PXY_pos, PXY_neg

PXY_pos, PXY_neg = naivebayesPXY(X,Y)

In [67]:
def naivebayes(x,y,xtest):
    """
    function logratio = naivebayes(x,y);
    
    Computation of log P(Y|X=x1) using Bayes Rule
    Input:
        x : n input vectors of d dimensions (nxd)
        y : n labels (-1 or +1)
        xtest: input vector of d dimensions (1xd)
    
    Output:
        logratio: log (P(Y=1|X=xtest)*P(Y=1)/P(Y=-1|X=xtest)*P(Y=-1))
    """
    PXY_pos, PXY_neg = naivebayesPXY(x,y)
    PY_pos, PY_neg = naivebayesPY(x,y)
    
    print(PXY_pos.shape)
    # evaluate xtest, and calculate P(Y = 1|X=xtest) and P(Y=-1|X=xtest)
    
    #find indicies for letters in incoming text
    xtest_idx = np.where(xtest == 1)
    
    # Calculate P(Y=1|X=xtest)*P(Y=1) and P(Y=-1|X=xtest)*P(Y=-1)
    PXY_pos_PY_pos = np.prod(PXY_pos[xtest_idx])*PY_pos
    PXY_neg_PY_neg = np.prod(PXY_neg[xtest_idx])*PY_neg
    
    # return likelihood ratio
    return np.log(PXY_pos_PY_pos/PXY_neg_PY_neg)

p = naivebayes(X,Y,X[0,:])
p

(128,)


0.9455943956563633

(d) Naïve Bayes can also be written as a linear classifier. Implement this in  naivebayesCL.

In [99]:
def naivebayesCL(x,y):
    """
    function [w,b]=naivebayesCL(x,y);
    Implementation of a Naive Bayes classifier
    Input:
        x : n input vectors of d dimensions (nxd)
        y : n labels (-1 or +1)

    Output:
        w : weight vector of d dimensions
        b : bias (scalar)
    """
    
    n, d = x.shape
    
    PXY_pos, PXY_neg = naivebayesPXY(x,y)
    PY_pos, PY_neg = naivebayesPY(x,y)
    
    # w = log( P(X_j|Y=1) / P(X_j|Y=0) )
    w = np.array([np.log(PXY_pos_j/PXY_neg_j) for PXY_pos_j, PXY_neg_j in zip(PXY_pos, PXY_neg)])
    b = np.log(PY_pos/PY_neg)
    return w, b 

w,b = naivebayesCL(X,Y)
#w

(e) Implement  classifyLinear that applies a linear weight vector and bias to a set of input vectors and outputs their predictions. (You can use your answer from the previous project.)

In [125]:
#<GRADED>
def classifyLinear(x,w,b=0,hyper=0):
    """
    function preds=classifyLinear(x,w,b);
    
    Make predictions with a linear classifier
    Input:
        x : n input vectors of d dimensions (nxd)
        w : weight vector of d dimensions
        b : bias (optional)
    
    Output:
        preds: predictions
    """
    
    result = x.dot(w) + b
    # is there a way to do this with numpy arrays?
    pred = [1*(num>hyper) + -1*(num<=hyper) for num in result]
    
    return np.array(pred)

print('Training error: %.2f%%' % (100 *(classifyLinear(X, w, b) != Y).mean()))

Training error: 21.58%


You can now test your code with the following interactive name classification script:

In [109]:
DIMS = 128
print('Loading data ...')
X,Y = genTrainFeatures(DIMS)
print('Training classifier ...')
w,b=naivebayesCL(X,Y)
error = np.mean(classifyLinear(X,w,b) != Y)
print('Training error: %.2f%%' % (100 * error))

while True:
    print('Please enter your name>')
    yourname = input()
    if len(yourname) < 1:
        break
    xtest = name2features(yourname,B=DIMS,LoadFile=False)
    pred = classifyLinear(xtest,w,b)[0]
    if pred > 0:
        print("%s, I am sure you are a nice boy.\n" % yourname)
    else:
        print("%s, I am sure you are a nice girl.\n" % yourname)

Loading data ...
Training classifier ...
Training error: 21.58%
Please enter your name>


 Joe


Joe, I am sure you are a nice boy.

Please enter your name>


 Donald


Donald, I am sure you are a nice boy.

Please enter your name>


 Allen


Allen, I am sure you are a nice boy.

Please enter your name>


 
