# Module 12 - Programming Assignment

## Directions

There are general instructions on Blackboard and in the Syllabus for Programming Assignments. This Notebook also has instructions specific to this assignment. Read all the instructions carefully and make sure you understand them. Please ask questions on the discussion boards or email me at `EN605.445@gmail.com` if you do not understand something.

<div style="background: mistyrose; color: firebrick; border: 2px solid darkred; padding: 5px; margin: 10px;">
You must follow the directions *exactly* or you will get a 0 on the assignment.
</div>

You must submit a zip file of your assignment and associated files (if there are any) to Blackboard. The zip file will be named after you JHED ID: `<jhed_id>.zip`. It will not include any other information. Inside this zip file should be the following directory structure:

```
<jhed_id>
    |
    +--module-01-programming.ipynb
    +--module-01-programming.html
    +--(any other files)
```

For example, do not name  your directory `programming_assignment_01` and do not name your directory `smith122_pr1` or any else. It must be only your JHED ID. Make sure you submit both an .ipynb and .html version of your *completed* notebook. You can generate the HTML version using:

> ipython nbconvert [notebookname].ipynb

or use the File menu.

In [1]:
from __future__ import division
import numpy as np
import csv

## Naive Bayes Classifier

In this assignment you will be using the mushroom data from the Decision Tree module:

http://archive.ics.uci.edu/ml/datasets/Mushroom

The assignment is to write a program that will learn and apply a Naive Bayes Classifier for this problem. You'll first need to calculate all of the necessary probabilities (don't forget to use +1 smoothing) using a `learn` function. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple is a class and the *normalized* probability of that class. The List should be sorted so that the probabilities are in descending order. For example,

```
[("e", 0.98), ("p", 0.02)]
```

when calculating the error rate of your classifier, you should pick the class with the highest probability (the first one in the list).

As a reminder, the Naive Bayes Classifier generates the un-normalized probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You'll also need an `evaluate` function as before. You should use the $error\_rate$ again.

Use the same testing procedure as last time, on two randomized subsets of the data:

1. learn the probabilities for set 1
2. classify set 2
3. evaluate the predictions
4. learn the probabilities for set 2
5. classify set 1
6. evalute the the predictions
7. average the classification error.

-----

### Encoding Metadata###

Based on the `agaricus-lepiota.names` file, the metadata is encoded as follows. This is necessary due to the possibility of not encountering every feature values in the training data and having to the cardinality of each features for probability smoothing. Both the sample space of the features as well as the column headers are encoded below.

The possible values of features are encoded in a dictionary where the keys are the feature names and the values are set of possible values. The column names are encoded in a list with the same order as it is encountered.

In [2]:
featureVals = {
    "class": set(['p', 'e']),
    "cap-shape": set(["b","c","x","f","k","s"]),
    "cap-surface": set(["f","g","y","s"]),
    "cap-color": set(["n","b","c","g","r","p","u","e","w","y"]),
    "bruises": set(["t","f"]),
    "odor": set(["a","l","c","y","f","m","n","p","s"]),
    "gill-attachment": set(["a","d","f","n"]),
    "gill-spacing": set(["c","w","d"]),
    "gill-size": set(["b","n"]),
    "gill-color": set(["k","n","b","h","g","r","o","p","u","e","w","y"]),
    "stalk-shape": set(["e","t"]),
    "stalk-root": set(["b","c","u","e","z","r","?"]),
    "stalk-surface-above-ring": set(["f","y","k","s"]),
    "stalk-surface-below-ring": set(["f","y","k","s"]),
    "stalk-color-above-ring": set(["n","b","c","g","o","p","e","w","y"]),
    "stalk-color-below-ring": set(["n","b","c","g","o","p","e","w","y"]),
    "veil-type": set(["p","u"]),
    "veil-color": set(["n","o","w","y"]),
    "ring-number": set(["n","o","t"]),
    "ring-type": set(["c","e","f","l","n","p","s","z"]),
    "spore-print-color": set(["k","n","b","h","r","o","u","w","y"]),
    "population": set(["a","c","n","s","v","y"]),
    "habitat": set(["g","l","m","p","u","w","d"])
}

headers = ['class', 'cap-shape', 'cap-surface', 'cap-color' , 'bruises', 'odor', 
           'gill-attachment' , 'gill-spacing', 'gill-size', 'gill-color' , 'stalk-shape',
           'stalk-root', 'stalk-surface-above-ring' , 'stalk-surface-below-ring',
           'stalk-color-above-ring', 'stalk-color-below-ring' , 'veil-type', 'veil-color',
           'ring-number' , 'ring-type', 'spore-print-color', 'population', 'habitat']

### Reading in Data ###

Read the CSV file in as numpy recarray where the record names are the headers as encoded above. The sequence of the data is then randomized and split into two subsets via slicing.

In [3]:
def readCSV(filePath):
    with open(filePath, 'rt') as f:
        reader = csv.reader(f)
        l = list(reader)
    return l

raw = np.rec.fromrecords( readCSV('agaricus-lepiota.data') , names=headers)

idx = np.array(range(raw.size))
np.random.shuffle(idx) # shuffle the indices
data1 = raw[ idx[:int(np.floor(raw.size/2))] ] # first half of data
data2 = raw[ idx[int(np.floor(raw.size/2)):] ] # second half of data

### Learning Algorithm ###

**condProb(data, N, meta)**  
The function takes input of a data in the form of a recarray, an integer denoting the number of records having the current class value, and the metadata which is a tuple containing the dictionary of feature values and list of headers. The recarray is assumed to have homogenous class values, as the function is used to calculate probabilities conditioned on a record having a specific class value. 

The function iterate through all possible features (using the header in the metadata parameter) and for each feature, calculates the conditional probability as follows:

$$Pr(x_i|C=c)=\frac{\#(x_i,c)+1}{\#c+|x|+1}$$

where $|x|$ is the cardinality of the feature.

The conditional probability of each feature would be stored in a dictionary whose keys are the feature values. The dictionary for each feature will be keyed by the feature name in a encompassing dictionary and returned.


**learn(data)**  
The `learn` function takes input of a data in the form of a recarray and the metadata which is a tuple containing the dictionary of feature values and list of headers. 

The function iterate through all possible class values, and for each value, it calculates the conditional probability of the data having a specific feature value given that it has the specified class value via the function `condProb`. Each iteration the function slices the input data to ensure homogeneity in the data.

The output is a dictionary with two keys, `prob` and `cond`. The `prob` key has the values of another dict which has the keys of class values and dictionary value of unconditional probabilities for these class values. Under the `cond` key is another dictionary whose keys are the feature names with values of another dictionary containing the conditional probabilities of the various values conditioned on having a specific class value.



In [4]:
def condProb(data, N, meta): # the data is assumed to be class-homogenous
    features = set(meta[0]) - set(['class'])
    featureVals = meta[1]
    condPr = dict()
    for f in features:
        card = len(featureVals[f]) # cardinality of class
        defProb = 1 / (N + card + 1)
        tmp = dict([(v,defProb) for v in featureVals[f]]) # initialize base
        vals, vN = np.unique(data[f], return_counts=True)
        vPr = (vN + 1) / (N + card + 1)
        for i,v in enumerate(vals):
            tmp[v] = vPr[i]
        condPr[f] = tmp

    return condPr

def learn(data, meta):
    pr = dict()
    classes,classN = np.unique(data['class'], return_counts=True )
    tmp = (classN+1) / (data.size + len(classes)+ 1)
    
    pr['prob'] = dict( [(c,v) for c,v in zip(classes,tmp)] )
    pr['cond'] = dict()
    for x,c in enumerate(classes):
        idx = data['class']==c;
        pr['cond'][c] = condProb(data[idx], np.sum(idx), meta)
        
    return pr

### Classification algorithm ###

The function takes three input: the conditional probabilities as calculated by `learn(.)` function, a data to be classified as a numpy recarray, and a metadata of all feature values encoded as a dict.

The function calculates $Pr(c)=\prod_{i,j}{p(a_i=v_j|c)}$ for each of the class values by iterating through each of the records and calculating the cumulative product of all the conditional probabilities and the unconditional probability of a class value.

The posterior probabilities for each record are then normalized, sorted in descending order, and put into a list of list of tuples containing both the class value and the noramlized probability a la `[("e", 0.98), ("p", 0.02)]`.

In [5]:
def classify(pr, data, featureVals):
    features = [h for h in data.dtype.names if not h=='class']
    classes = np.array(list(featureVals['class']))
    
    tmp = np.zeros([data.size, len(classes)])
    for i,c in enumerate(classes):
        conds = pr['cond'][c]
        base = [pr['prob'][c]]
        for n in range(data.size):
            tmp[n][i] = np.cumprod(base + 
                                [conds[f][data[n][f]] for f in features])[-1]
            
    normed = (tmp.T / np.sum(tmp, axis=1)).T # normalized probability
    idx = np.argsort(-normed, axis=1)# index per row, sort by desc. prob
    sortedClass = classes[idx] # classes
    sortedProb = normed[np.tile(np.arange(data.size), [2,1]).T, idx] # prob
    
    return [[(a,b) for a,b in zip(k,p)] for k,p in zip(sortedClass, sortedProb)]

### Evaluation algorithm ###
The function calculates the error rate by calculating the instances of prediction not matching the actual class. Formally, it returns: 
$$\frac{1}{n}\sum_{i}^{n}{I_{k_i\ne\hat{k_i}}(i\in x)}$$

In [6]:
def evaluate(data, classes):
    act = data['class']
    pred = np.array(classes)[:,0,0]
    return np.sum(act!=pred) / data.size * 100

-----

### Classification and Evaluation ###

Learning probability from set 1 and classifying set 2, and vise versa

In [7]:
meta = (headers, featureVals)
prob1 = learn(data1, meta)
pred2 = classify(prob1, data2, featureVals)
prob2 = learn(data2, meta)
pred1 = classify(prob2, data1, featureVals)

Calculating the overall classification error rate

In [8]:
err = evaluate( np.hstack([data1,data2]), np.vstack([pred1,pred2]) )
print('Average error rate: %.2f%%' % err)

Average error rate: 5.11%
