<h2>Project 3: Na&iuml;ve Bayes</h2>

<blockquote>
    <center>
    <img src="nb.png" width="200px" />
    </center>
      <p><cite><center>"All models are wrong, but some are useful."<br>
       -- George E.P. Box
      </center></cite></p>
</blockquote>

<h3>Introduction</h3>
<!--Aðalbrandr-->

<p>A&eth;albrandr is visiting America from Norway and has been having the hardest time distinguishing boys and girls because of the weird American names like Jack and Jane.  This has been causing lots of problems for A&eth;albrandr when he goes on dates. When he heard that Cornell has a Machine Learning class, he asked that we help him identify the gender of a person based on their name to the best of our ability.  In this project, you will implement Na&iuml;ve Bayes to predict if a name is male or female. </p>

<strong>How to submit:</strong> You can submit your code using the red <strong>Submit</strong> button above. This button will send any code below surrounded by <strong>#&lt;GRADED&gt;</strong><strong>#&lt;/GRADED&gt;</strong> tags below to the autograder, which will then run several tests over your code. By clicking on the <strong>Details</strong> dropdown next to the Submit button, you will be able to view your submission report once the autograder has completed running. This submission report contains a summary of the tests you have failed or passed, as well as a log of any errors generated by your code when we ran it.

Note that this may take a while depending on how long your code takes to run! Once your code is submitted you may navigate away from the page as you desire -- the most recent submission report will always be available from the Details menu.

<p><strong>Evaluation:</strong> Your code will be autograded for technical
correctness and--on some assignments--speed. Please <em>do not</em> change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. Furthermore, <em>any code not surrounded by <strong>#&lt;GRADED&gt;</strong><strong>#&lt;/GRADED&gt;</strong> tags will not be run by the autograder</em>. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score.  If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.

<p><strong>Academic Integrity:</strong> We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; <em>please</em> don't let us down. If you do, we will pursue the strongest consequences available to us.

<p><strong>Getting Help:</strong> You are not alone!  If you find yourself stuck  on something, contact the course staff for help.  Office hours, section, and the <a href="https://piazza.com/class/iyag4nk2rsxsv">Piazza</a> are there for your support; please use them.  If you can't make our office hours, let us know and we will schedule more.  We want these projects to be rewarding and instructional, not frustrating and demoralizing.  But, we don't know when or how to help unless you ask.
  

<h3> Of boys and girls </h3>

<p> Take a look at the files <code>girls.train</code> and <code>boys.train</code>. For example with the unix command <pre>cat girls.train</pre> 
<pre>
...
Addisyn
Danika
Emilee
Aurora
Julianna
Sophia
Kaylyn
Litzy
Hadassah
</pre>
Believe it or not, these are all more or less common girl names. The problem with the current file is that the names are in plain text, which makes it hard for a machine learning algorithm to do anything useful with them. We therefore need to transform them into some vector format, where each name becomes a vector that represents a point in some high dimensional input space. </p>

<p>That is exactly what the following Python function <code>name2features</code> does: </p>

In [3]:
import numpy as np
import sys

%matplotlib inline

In [2]:
def hashfeatures(baby, B, FIX):
    v = np.zeros(B)
    for m in range(FIX):
        featurestring = "prefix" + baby[:m]
        v[hash(featurestring) % B] = 1
        featurestring = "suffix" + baby[-1*m:]
        v[hash(featurestring) % B] = 1
    return v


In [3]:
def name2features(filename, B=128, FIX=3, LoadFile=True):
    """
    Output:
        X : n feature vectors of dimension B, (nxB)
    """
    # read in baby names
    if LoadFile:
        with open(filename, 'r') as f:
            babynames = [x.rstrip() for x in f.readlines() if len(x) > 0]
    else:
        babynames = filename.split('\n')
    n = len(babynames)
    X = np.zeros((n, B))
    for i in range(n):
        X[i,:] = hashfeatures(babynames[i], B, FIX)
    return X

It reads every name in the given file and converts it into a 128-dimensional feature vector. </p> 

<p>Can you figure out what the features are? (Understanding how these features are constructed will help you later on in the competition.)<br></p>

<p>We have provided you with a python function <code>genTrainFeatures</code>, which transforms the names into features and loads them into memory. 

In [4]:
def genTrainFeatures(dimension=128):
    """
    Input: 
        dimension: desired dimension of the features
    Output: 
        X: n feature vectors of dimensionality d (nxd)
        Y: n labels (-1 = girl, +1 = boy) (n)
    """
    
    # Load in the data
    Xgirls = name2features("girls.train", B=dimension)
    Xboys = name2features("boys.train", B=dimension)
    X = np.concatenate([Xgirls, Xboys])
    
    # Generate Labels
    Y = np.concatenate([-np.ones(len(Xgirls)), np.ones(len(Xboys))])
    
    # shuffle data into random order
    ii = np.random.permutation([i for i in range(len(Y))])
    
    return X[ii, :], Y[ii]

You can call the following command to load in the features and the labels of all boys and girls names. 

In [None]:
X, Y = genTrainFeatures(128)

<h3> The Na&iuml;ve Bayes Classifier </h3>

<p> The Na&iuml;ve Bayes classifier is a linear classifier based on Bayes Rule. The following questions will ask you to finish these functions in a pre-defined order. <br>
<strong>As a general rule, you should avoid tight loops at all cost.</strong></p>
<p>(a) Estimate the class probability $P(y)$ in 
<b><code>naivebayesPY</code></b>
. This should return the probability that a sample in the training set is positive or negative, independent of its features.
</p>



In [None]:
def naivebayesPY(X, Y):
    """
    naivebayesPY(Y) returns [pos,neg]

    Computation of P(Y)
    Input:
        X : n input vectors of d dimensions (nxd)
        Y : n labels (-1 or +1) (nx1)

    Output:
        pos: probability p(y=1)
        neg: probability p(y=-1)
    """
    
    # add one positive and negative example to avoid division by zero ("plus-one smoothing")
    Y = np.concatenate([Y, [-1,1]])
    n = len(Y)
    ### BEGIN SOLUTION
    pos = np.mean(Y == 1)
    neg = np.mean(Y == -1)
    return pos, neg
    ### END SOLUTION


pos,neg = naivebayesPY(X,Y)

<p>(b) Estimate the conditional probabilities $P([\mathbf{x}]_{\alpha}|y)$ in 
<b><code>naivebayesPXY</code></b> Notice that by construction, our features are binary categorical features. 
.  Use a <b>categorical</b> distribution as model and return the probability vectors for each features being 1 given a class label.
</p> 

In [None]:

def naivebayesPXY(X,Y):
    """
    naivebayesPXY(X, Y) returns [posprob,negprob]
    
    Input:
        X : n input vectors of d dimensions (nxd)
        Y : n labels (-1 or +1) (n)
    
    Output:
        posprob: probability vector of p(x_alpha = 1|y=1)  (d)
        negprob: probability vector of p(x_alpha = 1|y=-1) (d)
    """
    
    # add one positive and negative example to avoid division by zero ("plus-one smoothing")
    n, d = X.shape
    X = np.concatenate([X, np.ones((2,d)), np.zeros((2,d))])
    Y = np.concatenate([Y, [-1,1,-1,1]])
    n, d = X.shape
    
    ### BEGIN SOLUTION
    posprob = np.mean(X[Y == 1], axis=0)
    negprob = np.mean(X[Y == -1], axis=0)
    return posprob, negprob
    ### END SOLUTION
    

posprob,negprob = naivebayesPXY(X,Y)

<p>(b) Assume you are using the natural log. Calculate the log likelihood $\log P(\mathbf{x}|y)$ for each point in X_test in 
<b><code>loglikelihood</code></b>. Think carefully how you would use the Na&iuml;ve Bayes assumption to calculate the likelihood and how you can use the fact $\log(ab) = \log a + \log b$ to simplify your calculations.
</p> 

In [None]:
def loglikelihood(posprob, negprob, X_test, Y_test):
    """
    loglikelihood(posprob, negprob, X_test, Y_test) returns loglikelihood of each point in X_test
    
    Input:
        posprob: conditional probabilities for the positive class (d)
        negprob: conditional probabilities for the negative class (d)
        X_test : features (nxd)
        Y_test : labels (-1 or +1) (n)
    
    Output:
        loglikelihood of each point in X_test (n)
    """
    n, d = X_test.shape
    loglikelihood = np.zeros(n)
    
    ### BEGIN SOLUTION
    pos_ind = (Y_test == 1)
    loglikelihood[pos_ind] = X_test[pos_ind]@np.log(posprob) + (1 - X_test[pos_ind])@np.log(1 - posprob)
    neg_ind = (Y_test == -1)
    loglikelihood[neg_ind] = X_test[neg_ind]@np.log(negprob) + (1 - X_test[neg_ind])@np.log(1 - negprob)
    ### END SOLUTION
    return loglikelihood

<p>(c) Observe that for a test point $\mathbf{x}_{test}$, we should classify it as positive if the log ratio $\log\left(\frac{P(y=1 | \mathbf{x} = \mathbf{x}_{test})}{P(y=-1|\mathbf{x} = \mathbf{x}_{test})}\right) > 0$ and negative otherwise. Implement the <b><code>naivebayes_pred</code></b> by first calculating the log ratio $\log\left(\frac{P(y=1 | \mathbf{x} = \mathbf{x}_{test})}{P(y=-1|\mathbf{x} = \mathbf{x}_{test})}\right)$ for each test point in X_test using Bayes rule and predict the label of the test points by looking at the log ratio. When calculating the log likelihood, think carefully how you can use the fact $\log \left(\frac{a}{b}\right) = \log{a} - \log{b}$ to simplify your calculations.
</p>



In [None]:
#<GRADED>
def naivebayes_pred(pos, neg, posprob, negprob, X_test):
    """
    naivebayes_pred(pos, neg, posprob, negprob, X_test) returns the prediction of each point in X_test
    
    Input:
        pos: class probability for the negative class
        neg: class probability for the positive class
        posprob: conditional probabilities for the positive class (d)
        negprob: conditional probabilities for the negative class (d)
        X_test : features (nxd)
    
    Output:
        prediction of each point in X_test (n)
    """
    n, d = X_test.shape
    
    ### BEGIN SOLUTION
    loglikelihood_ratio = loglikelihood(posprob, negprob, X_test, np.ones(n)) - \
        loglikelihood(posprob, negprob, X_test, -np.ones(n)) + np.log(pos) - np.log(neg)
    preds = - np.ones(n)
    preds[loglikelihood_ratio > 0] = 1
    return preds
    ### END SOLUTION
    
#</GRADED>

You can now test your code with the following interactive name classification script:

In [None]:
DIMS = 128
print('Loading data ...')
X,Y = genTrainFeatures(DIMS)
print('Training classifier ...')
pos, neg = naivebayesPY(X, Y)
posprob, negprob = naivebayesPXY(X, Y)
error = np.mean(naivebayes_pred(pos, neg, posprob, negprob, X) != Y)
print('Training error: %.2f%%' % (100 * error))

while True:
    print('Please enter your name>')
    yourname = input()
    if len(yourname) < 1:
        break
    xtest = name2features(yourname,B=DIMS,LoadFile=False)
    pred = naivebayes_pred(pos, neg, posprob, negprob, xtest)
    if pred > 0:
        print("%s, I am sure you are a nice boy.\n" % yourname)
    else:
        print("%s, I am sure you are a nice girl.\n" % yourname)

Loading data ...
Training classifier ...
Training error: 22.83%
Please enter your name>
Paul
Paul, I am sure you are a nice boy.

Please enter your name>
Emily
Emily, I am sure you are a nice boy.

Please enter your name>
Lara
Lara, I am sure you are a nice girl.

Please enter your name>
Emilee
Emilee, I am sure you are a nice girl.

Please enter your name>
Tamy
Tamy, I am sure you are a nice girl.

Please enter your name>
Kilian
Kilian, I am sure you are a nice boy.

Please enter your name>
Cheng
Cheng, I am sure you are a nice boy.

Please enter your name>
Carol
Carol, I am sure you are a nice boy.

Please enter your name>
Tony
Tony, I am sure you are a nice boy.

Please enter your name>
Pauline
Pauline, I am sure you are a nice girl.

Please enter your name>


<h3> Feature Extraction (Competition)</h3>

<p>(d) (<b>optional</b>) As always, this programming project also includes a competition.  We will rank all submissions by how well your Na&iuml;ve Bayes classifier performs on a secret test set. If you want to improve your classifier modify <code>name2features2</code> below.   The automatic reader will use your Python script to extract features and train your classifier on the same names training set by calling the function with only one argument--the name of a file containing a list of names.  The given implementation is the same as the given <code>name2features</code> above.
</p>
  

In [None]:
def hashfeatures(baby, B):
    v = np.zeros(B)
    for letter in baby:
        v[hash(letter) % B] = 1
    return v

def name2features2(filename, B=128, LoadFile=True):
    """
    Output:
    X : n feature vectors of dimension B, (nxB)
    """
    # read in baby names
    if LoadFile:
        with open(filename, 'r') as f:
            babynames = [x.rstrip() for x in f.readlines() if len(x) > 0]
    else:
        babynames = filename.split('\n')
    n = len(babynames)
    X = np.zeros((n, B))
    for i in range(n):
        X[i,:] = hashfeatures(babynames[i], B)
    return X

<h4>Credits</h4>
 The name classification idea originates from <a href="http://nickm.com">Nick Montfort</a>.

In [4]:
### BEGIN HIDDEN TESTS
# Instructor's internal Code
boy_train_file =  "boys.train"
girl_train_file = "girls.train"
boy_test_file = "boys.test"
girl_test_file = "girls.test"

def hashfeatures_grader(baby, B, FIX):
    v = np.zeros(B)
    for m in range(FIX):
        featurestring = "prefix" + baby[:m]
        v[hash(featurestring) % B] = 1
        featurestring = "suffix" + baby[-1*m:]
        v[hash(featurestring) % B] = 1
    return v

def name2features_grader(filename, B=128, FIX=3, LoadFile=True):
    """
    Output:
    X : n feature vectors of dimension B, (nxB)
    """
    # read in baby names
    if LoadFile:
        with open(filename, 'r') as f:
            babynames = [x.rstrip() for x in f.readlines() if len(x) > 0]
    else:
        babynames = filename.split('\n')
    n = len(babynames)
    X = np.zeros((n, B))
    for i in range(n):
        X[i,:] = hashfeatures_grader(babynames[i], B, FIX)
    return X

def genTrainFeatures_grader(dimension=128, fix=3, g=girl_train_file, b=boy_train_file ):
    Xgirls = name2features_grader(g, B=dimension, FIX=fix)
    Xboys = name2features_grader(b, B=dimension, FIX=fix)
    X = np.concatenate([Xgirls, Xboys])

    Y = np.concatenate([-np.ones(len(Xgirls)), np.ones(len(Xboys))])

    ii = np.random.permutation([i for i in range(len(Y))])

    return X[ii, :], Y[ii]

def analyze_grader(kind,truth,preds):
    truth = truth.flatten()
    preds = preds.flatten()
    if kind == 'abs':
        # compute the absolute difference between truth and predictions
        output = np.sum(np.abs(truth - preds)) / float(len(truth))
    elif kind == 'acc':
        if len(truth) == 0 and len(preds) == 0:
            output = 0
        else:
            output = np.sum(truth == preds) / float(len(truth))
    return output

## instructor code for testing ##

def naivebayesPY_grader(x,y):
    y = np.concatenate([y, [-1,1]])
    n = len(y)
    pos = np.mean(y == 1)
    neg = np.mean(y == -1)
    return pos, neg

def naivebayesPXY_grader(x,y):
    n, d = x.shape
    x = np.concatenate([x, np.ones((2,d)), np.zeros((2, d))])
    y = np.concatenate([y, [-1,1,-1,1]])
    n, d = x.shape
    posprob = np.mean(x[y == 1], axis=0)
    negprob = np.mean(x[y == -1], axis=0)
    return posprob, negprob

def loglikelihood_grader(posprob, negprob, x_test, y_test):
    # calculate the likelihood of each of the point in x_test log P(x | y)
    n, d = x_test.shape
    loglikelihood = np.zeros(n)
    pos_ind = (y_test == 1)
    loglikelihood[pos_ind] = x_test[pos_ind]@np.log(posprob) + (1 - x_test[pos_ind])@np.log(1 - posprob)
    neg_ind = (y_test == -1)
    loglikelihood[neg_ind] = x_test[neg_ind]@np.log(negprob) + (1 - x_test[neg_ind])@np.log(1 - negprob)
    
    return loglikelihood

def naivebayes_pred_grader(pos, neg, posprob, negprob, x_test):
    n, d = x_test.shape 
    #     raise NotImplementedError
    loglikelihood_ratio = loglikelihood_grader(posprob, negprob, x_test, np.ones(n)) - \
        loglikelihood_grader(posprob, negprob, x_test, -np.ones(n)) + np.log(pos) - np.log(neg)
    preds = - np.ones(n)
    preds[loglikelihood_ratio > 0] = 1
    return preds

X,Y = genTrainFeatures_grader(128)
posY, negY = naivebayesPY_grader(X, Y)
posprobXY, negprobXY = naivebayesPXY_grader(X, Y)
### END HIDDEN TESTS

In [6]:
### BEGIN HIDDEN TESTS
# Check if probabilities sum to 1
def naivebayesPY_test1():
    pos,neg = naivebayesPY(X,Y)
    return np.linalg.norm(pos + neg - 1) < 1e-5

# Test the Naive Bayes function on a simple matrix
def naivebayesPY_test2():
    x = np.array([[0,1],[1,0]])
    y = np.array([-1,1])
    pos, neg = naivebayesPY(x,y)
    pos0, neg0 = .5, .5
    test = np.linalg.norm(pos - pos0) + np.linalg.norm(neg - neg0)
    return test < 1e-5

# Test the Naive Bayes on a longer non-square matrix
def naivebayesPY_test3():
    x = np.array([[0,1,1,0,1],
        [1,0,0,1,0],
        [1,1,1,1,0],
        [0,1,1,0,1],
        [1,0,1,0,0],
        [0,0,1,0,0],
        [1,1,1,0,1]])    
    y = np.array([1,-1, 1, 1,-1,-1, 1])
    pos, neg = naivebayesPY(x,y)
    pos0, neg0 = 5/9., 4/9.
    test = np.linalg.norm(pos - pos0) + np.linalg.norm(neg - neg0)
    return test < 1e-5

# Tests that student didn't delete plus-one smoothing
def naivebayesPY_test4():
    x = np.array([[0,1,1,0,1],
        [1,0,0,1,0]])    
    y = np.array([1,1])
    pos, neg = naivebayesPY(x,y)
    pos0, neg0 = 3/4., 1/4.
    test = np.linalg.norm(pos - pos0) + np.linalg.norm(neg - neg0)
    return test < 1e-5

assert naivebayesPY_test1(), "naivebayesPY test failed - probabilities not adding to 1\n"
assert naivebayesPY_test2(), "naivebayesPY test failed - calculation of P(Y) seems incorrect\n" 
assert naivebayesPY_test3(), "naivebayesPY test failed - calculation of P(Y) seems incorrect\n"
assert naivebayesPY_test4(), "naivebayesPY test failed - did you do plus one smoothing?\n"
### END HIDDEN TESTS

NameError: name 'naivebayesPY' is not defined