# Module 12 - Programming Assignment

## Directions

There are general instructions on Blackboard and in the Syllabus for Programming Assignments. This Notebook also has instructions specific to this assignment. Read all the instructions carefully and make sure you understand them. Please ask questions on the discussion boards or email me at `EN605.445@gmail.com` if you do not understand something.

<div style="background: mistyrose; color: firebrick; border: 2px solid darkred; padding: 5px; margin: 10px;">
You must follow the directions *exactly* or you will get a 0 on the assignment.
</div>

You must submit a zip file of your assignment and associated files (if there are any) to Blackboard. The zip file will be named after you JHED ID: `<jhed_id>.zip`. It will not include any other information. Inside this zip file should be the following directory structure:

```
<jhed_id>
    |
    +--module-01-programming.ipynb
    +--module-01-programming.html
    +--(any other files)
```

For example, do not name  your directory `programming_assignment_01` and do not name your directory `smith122_pr1` or any else. It must be only your JHED ID. Make sure you submit both an .ipynb and .html version of your *completed* notebook. You can generate the HTML version using:

> ipython nbconvert [notebookname].ipynb

or use the File menu.

In [1]:
from __future__ import division
import numpy as np
import csv

## Naive Bayes Classifier

In this assignment you will be using the mushroom data from the Decision Tree module:

http://archive.ics.uci.edu/ml/datasets/Mushroom

The assignment is to write a program that will learn and apply a Naive Bayes Classifier for this problem. You'll first need to calculate all of the necessary probabilities (don't forget to use +1 smoothing) using a `learn` function. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple is a class and the *normalized* probability of that class. The List should be sorted so that the probabilities are in descending order. For example,

```
[("e", 0.98), ("p", 0.02)]
```

when calculating the error rate of your classifier, you should pick the class with the highest probability (the first one in the list).

As a reminder, the Naive Bayes Classifier generates the un-normalized probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You'll also need an `evaluate` function as before. You should use the $error\_rate$ again.

Use the same testing procedure as last time, on two randomized subsets of the data:

1. learn the probabilities for set 1
2. classify set 2
3. evaluate the predictions
4. learn the probabilities for set 2
5. classify set 1
6. evalute the the predictions
7. average the classification error.

-----

### Encoding Metadata###

In [135]:
featureVals = {
    "cap-shape": set(["b","c","x","f","k","s"]),
    "cap-surface": set(["f","g","y","s"]),
    "cap-color": set(["n","b","c","g","r","p","u","e","w","y"]),
    "bruises": set(["t","f"]),
    "odor": set(["a","l","c","y","f","m","n","p","s"]),
    "gill-attachment": set(["a","d","f","n"]),
    "gill-spacing": set(["c","w","d"]),
    "gill-size": set(["b","n"]),
    "gill-color": set(["k","n","b","h","g","r","o","p","u","e","w","y"]),
    "stalk-shape": set(["e","t"]),
    "stalk-root": set(["b","c","u","e","z","r","?"]),
    "stalk-surface-above-ring": set(["f","y","k","s"]),
    "stalk-surface-below-ring": set(["f","y","k","s"]),
    "stalk-color-above-ring": set(["n","b","c","g","o","p","e","w","y"]),
    "stalk-color-below-ring": set(["n","b","c","g","o","p","e","w","y"]),
    "veil-type": set(["p","u"]),
    "veil-color": set(["n","o","w","y"]),
    "ring-number": set(["n","o","t"]),
    "ring-type": set(["c","e","f","l","n","p","s","z"]),
    "spore-print-color": set(["k","n","b","h","r","o","u","w","y"]),
    "population": set(["a","c","n","s","v","y"]),
    "habitat": set(["g","l","m","p","u","w","d"])
}

headers = ['class', 'cap-shape', 'cap-surface', 'cap-color' , 'bruises', 'odor', 
           'gill-attachment' , 'gill-spacing', 'gill-size', 'gill-color' , 'stalk-shape',
           'stalk-root', 'stalk-surface-above-ring' , 'stalk-surface-below-ring',
           'stalk-color-above-ring', 'stalk-color-below-ring' , 'veil-type', 'veil-color',
           'ring-number' , 'ring-type', 'spore-print-color', 'population', 'habitat']

### Reading in Data ###

In [125]:
def readCSV(filePath):
    with open(filePath, 'rt') as f:
        reader = csv.reader(f)
        l = list(reader)
    return l

raw = np.rec.fromrecords( readCSV('agaricus-lepiota.data') , names=headers)

idx = range(raw.size)
np.random.shuffle(idx) # shuffle the indices
data1 = raw[ idx[:int(np.floor(raw.size/2))] ] # first half of data
data2 = raw[ idx[int(np.floor(raw.size/2)):] ] # second half of data

In [126]:
def countProb(pr):
    return (pr + 1) / (np.sum(pr)+1)

In [137]:
def learn(data):
    pr = dict()
    classes,classN = np.unique(data['class'], return_counts=True )
    for x,c in enumerate(classes):
        idx = data['class']==c;
        pr[c] = condProb(data[idx], np.sum(idx))
        
    return pr

################################################################################
def condProb(data, N): # the data is assumed to be class-homogenous
    features = set(headers) - set(['class'])
    condPr = dict()
    for f in features:
        try:
            tmp = dict([(v,1/(N+1)) for v in featureVals[f]]) # initialize base
        except KeyError:
            print( repr([(v,1/(N+1)) for v in featureVals[f]]) )
            raise
        vals, vN = np.unique(data[f], return_counts=True)
        vPr = (vN+1) / (N+1)
        for i,v in enumerate(vals):
            tmp[v] = vPr[i]
        
        condPr[f] = tmp
        
    return condPr
    
    
x = learn(data1)

-----

Put your main function calls here.

In [120]:
sum(countProb(np.array([40, 100, 20])))

1.012422360248447