# Module 11 - Programming Assignment

## Directions

There are general instructions on Blackboard and in the Syllabus for Programming Assignments. This Notebook also has instructions specific to this assignment. Read all the instructions carefully and make sure you understand them. Please ask questions on the discussion boards or email me at `EN605.445@gmail.com` if you do not understand something.

<div style="background: mistyrose; color: firebrick; border: 2px solid darkred; padding: 5px; margin: 10px;">
You must follow the directions *exactly* or you will get a 0 on the assignment.
</div>

You must submit a zip file of your assignment and associated files (if there are any) to Blackboard. The zip file will be named after you JHED ID: `<jhed_id>.zip`. It will not include any other information. Inside this zip file should be the following directory structure:

```
<jhed_id>
    |
    +--module-01-programming.ipynb
    +--module-01-programming.html
    +--(any other files)
```

For example, do not name  your directory `programming_assignment_01` and do not name your directory `smith122_pr1` or any else. It must be only your JHED ID. Make sure you submit both an .ipynb and .html version of your *completed* notebook. You can generate the HTML version using:

> ipython nbconvert [notebookname].ipynb

or use the File menu.

Add whatever additional imports you require here. Stick with the standard libraries and those required by the class. The import
gives you access to these functions: http://ipython.org/ipython-doc/stable/api/generated/IPython.core.display.html (Copy this link)
Which, among other things, will permit you to display HTML as the result of evaluated code (see HTML() or display_html()).

In [1]:
from __future__ import division # so that 1/2 = 0.5 and not 0
from IPython.core.display import *
import numpy as np
import copy, math, csv

## Decision Trees

For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

You can download the two files and read them to find out the attributes, attribute values and class labels as well as their locations in the file.

One of the things we did not talk about in the lectures was how to deal with missing values. In C4.5, missing values were handled by treating "?" as an implicit attribute value for every feature. For example, if the attribute was "size" then the domain would be ["small", "medium", "large", "?"]. Another approach is to skip instances with missing values. Yet another approach is to infer the missing value conditioned on the class. For example, if the class is "safe" and the color is missing, then we would infer the attribute value that is most often associated with "safe", perhaps "red". **Use the "?" approach for this assignment.**

As we did with the neural network, you should randomize your data (always randomize your data...you don't know if it is in some particular order like date of collection, by class label, etc.) and split it into two (2) sets. Train on the first set then test on the second set. Then train on the second set and test on the first set.

For regression, we almost always use something like Mean Squared Error to judge the performance of a model. For classification, there are a lot more options but for this assignment we will just look at classification error:

$$error\_rate=\frac{errors}{n}$$

You must implement four functions. `train` takes training_data and returns the Decision Tree as a data structure or object (for this one, I'm removing the OOP restriction...people often feel more comfortable writing a Tree in an OOP fashion). Make sure your Tree can be represented somehow.

```
def train( training_data):
   # returns a decision tree data structure
```

and `view` takes a tree and prints it out:

```
def view( tree):
    pass # probably doesn't return anything.
```

the purpose of the function is to be able to see what the tree looks like. It should be legible/pretty. You can use ASCII if you like or use something like NetworkX.

and `classify` takes a tree and a List of instances (possibly just one) and returns the classifications:

```
def classify( tree, test_data):
    # returns a list of classifications
```

and `evaluate` takes the classifications and the test_data and returns the error rate:

```
def evaluate( test_data, classifications):
    # returns an error rate
```

Basically, you're going to:

1. learn the tree for set 1
2. view the tree
3. classify set 2
4. evaluate the tree
5. learn the tree for set 2
6. view the tree
7. classify set 1
8. evalute the tree
9. average the classification error.

This is all that is required for this assignment. I'm leaving more of the particulars up to you but you can definitely use the last module as a guide.

**This is a very important assignment to reflect on the use of deepcopy because it has a natural recursive implementation**

-----

### Decision Tree class ###

In [2]:
class DTnode:
    def __init__(self, attrib):
        self.attrib = attrib
        self.isLeaf = True
        self.children = dict()

    def addChild(self, node, val):
        self.isLeaf = False
        self.children[val] = node
        
    def getChild(self, val):
        return self.children[val]
    
    def getValues(self):
        return set(self.children.keys())

    def __str__(self):
        if self.isLeaf:
            childTxt = 'terminal'
        else:
            childTxt = 'child: ' + str(list(self.children.keys()))
        return '[Node for %s, %s ]'%(self.attrib, childTxt)

In [3]:
################################################################################
def Hlog2(pr):
    out = np.zeros(pr.shape)
    z = np.logical_or(pr==0,pr==1)
    out[~z] = np.log2(pr[~z])
    return out

def H(d, A):
    vals, N = np.unique(d[A], return_counts=True)
    p = N/d.size
    return -np.sum( p*Hlog2(p) )

def B(p):
    return 0 if p==1 or p==0 else -(p*np.log2(p) + (1-p)*np.log2(1-p))

def edibleProb(d):
    return np.count_nonzero(d['class']=='e') / d.size
    
def Remainder(d, Attr):
    vals,N = np.unique(d[Attr], return_counts=True )
    entropies = [B(edibleProb(d[d[Attr]==v])) for v in vals]
    return np.sum(N/d.size * entropies)

def Importance(d, attribs):
    entropy = B(edibleProb(d))
    #print(entropy)
    gains = list(zip(*[(entropy - Remainder(d,A),A) for A in attribs]))
    return gains[1][np.argmax(gains[0])], np.max(gains[0])

def readCSV(filePath):
    with open(filePath, 'rt') as f:
        reader = csv.reader(f)
        l = list(reader)
    return l

In [4]:
def id3(d, attribs, default):
    if d.size == 0: # empty data
        return DTnode(default)
    prEd = edibleProb(d)
    majority = 'e' if prEd > 0.5 else 'p'
    if prEd in [0,1] or len(attribs)==0: # homogenous or no attribs left
        return DTnode(majority) 
    
    bestAttr, gain = Importance(d, attribs)
    attribSubset = attribs - set([bestAttr])
    nd = DTnode(bestAttr) # new node at the best attribute
    for v in set(d[bestAttr]):
        child = id3(d[d[bestAttr]==v], attribSubset, majority)
        nd.addChild(child, v)
    return nd

def train(data):
    return id3(data, attribSet, 'e' if edibleProb(data) > 0.5 else 'p')

In [5]:
def view(tree):
    def toStr(nd, level=0):
        if nd.isLeaf:
            return 'class: %s\n' % nd.attrib
        else:
            ret = 'Attribute [' + nd.attrib + "]:\n"
            nx = level + 1
            for key in nd.children:
                ret += "\t"*nx + 'val=%s, '%key + toStr(nd.children[key],nx)
            return ret
    
    print(toStr(tree))

In [6]:
def classify(tree, data):
    def recurClass(d, nd, idx, res):
        if nd.isLeaf:
            res[idx] = nd.attrib
            return out
        else:
            for k in nd.children:
                recurClass(d, nd.children[k], \
                           np.logical_and(idx, d[nd.attrib]==k), res)
            return out
        
    ind = np.array([True] * data.size)
    out = np.array([None] * data.size)
    return recurClass(data, tree, ind, out)

In [7]:
def evaluate(data, classes):
    return sum(data['class'] != classes) / data.size

---

### Reading Data File ###

In [8]:
headers = ['class', 'cap-shape', 'cap-surface', 'cap-color' , 'bruises', 'odor', 
           'gill-attachment' , 'gill-spacing', 'gill-size', 'gill-color' , 'stalk-shape',
           'stalk-root', 'stalk-surface-above-ring' , 'stalk-surface-below-ring',
           'stalk-color-above-ring', 'stalk-color-below-ring' , 'veil-type', 'veil-color',
           'ring-number' , 'ring-type', 'spore-print-color', 'population', 'habitat']
attribSet = set(headers) - set(['class'])

raw = np.rec.fromrecords( readCSV('agaricus-lepiota.data') , names=headers)
np.random.shuffle(raw) # shuffle the data
data1 = raw[:np.floor(raw.size/2)] # first half of data
data2 = raw[np.floor(raw.size/2):] # second half of data

** Train Decision Tree 1 from set 1**

In [9]:
tree1 = train(data1)

In [10]:
view(tree1)

Attribute [odor]:
	val=p, class: p
	val=f, class: p
	val=n, Attribute [spore-print-color]:
		val=h, class: e
		val=w, Attribute [ring-number]:
			val=o, Attribute [stalk-surface-above-ring]:
				val=f, class: e
				val=k, class: p
				val=y, class: p
				val=s, Attribute [cap-color]:
					val=c, class: e
					val=w, class: p
					val=n, class: e
			val=t, class: e
		val=n, class: e
		val=y, class: e
		val=o, class: e
		val=r, class: p
		val=k, class: e
		val=b, class: e
	val=y, class: p
	val=s, class: p
	val=c, class: p
	val=a, class: e
	val=l, class: e
	val=m, class: p



**Classify set 2 with tree 1 **

In [11]:
k2 = classify(tree1, data2)

---
** Train Decision Tree 2 from set 2**

In [12]:
tree2 = train(data2)

In [13]:
view(tree2)

Attribute [odor]:
	val=p, class: p
	val=f, class: p
	val=n, Attribute [spore-print-color]:
		val=h, class: e
		val=w, Attribute [habitat]:
			val=p, class: e
			val=w, class: e
			val=g, class: e
			val=l, Attribute [cap-color]:
				val=c, class: e
				val=w, class: p
				val=n, class: e
				val=y, class: p
			val=d, Attribute [population]:
				val=v, class: p
				val=y, class: e
		val=n, class: e
		val=y, class: e
		val=o, class: e
		val=r, class: p
		val=k, class: e
		val=b, class: e
	val=y, class: p
	val=s, class: p
	val=c, class: p
	val=a, class: e
	val=l, class: e
	val=m, class: p



**Classify set 1 with tree 2 **

In [14]:
k1 = classify(tree1, data1)

---
**Evaulate average erro rate**

In [15]:
e = evaluate(np.hstack((data1,data2)), np.hstack((k1,k2)))*100
print('Average error rate: %.2f%%' % e)

Average error rate: 0.00%
