# Module 8 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from copy import deepcopy
import random
from math import log2
from typing import List, Dict, Tuple, Callable

## Decision Trees

For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

**Just in case** the UCI repository is down, which happens from time to time, I have included the data and name files on Blackboard.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>

One of the things we did not talk about in the lectures was how to deal with missing values. There are two aspects of the problem here. What do we do with missing values in the training data? What do we do with missing values when doing classifcation?

There are a lot of different ways that we can handle this.
A common algorithm is to use something like kNN to impute the missing values.
We can use conditional probability as well.
There are also clever modifications to the Decision Tree algorithm itself that one can make.

We're going to do something simpler, given the size of the data set: remove the observations with missing values ("?").

You must implement the following functions:

`train` takes training_data and returns the Decision Tree as a data structure.

```
def train(training_data):
   # returns the Decision Tree.
```

`classify` takes a tree produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data).

```
def classify(tree, observations, labeled=True):
    # returns a list of classifications
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application).

Following Module 3's assignment, `cross_validate` should print out a table in exactly the same format. What you are looking for here is a consistent evaluation metric cross the folds. Print the error rate to 4 decimal places. **Do not convert to a percentage.**

```
def pretty_print_tree(tree):
    # pretty prints the tree
```

This should be a text representation of a decision tree trained on the entire data set (no train/test).

To summarize...

Apply the Decision Tree algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. When you are done, apply the Decision Tree algorithm to the entire data set and print out the resulting tree.

**Note** Because this assignment has a natural recursive implementation, you should consider using `deepcopy` at the appropriate places.

-----

# Overview

The data is given in "agaricus-lepiota.data" as a text file - each line in the text file is a single observation, with a total of 8124 observations, each with 22 attributes.

| Index | Variable                  | Description |
| ----- | -----------               | ----------- |
| 0     | **class label**           | edible=e,poisonous=p |
| 1     | cap-shape                 | bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
| 2     | cap-surface               | fibrous=f,grooves=g,scaly=y,smooth=s
| 3     | cap-color                 | brown=n,buff=b,cinnamon=c,gray=g,green=r,
|       |                           | pink=p,purple=u,red=e,white=w,yellow=y
| 4     | bruises?                  | bruises=t,no=f
| 5     | odor                      | almond=a,anise=l,creosote=c,fishy=y,foul=f,
|       |                           | musty=m,none=n,pungent=p,spicy=s
| 6     | gill-attachment           | attached=a,descending=d,free=f,notched=n
| 7     | gill-spacing              | close=c,crowded=w,distant=d
| 8     | gill-size                 | broad=b,narrow=n
| 9     | gill-color                | black=k,brown=n,buff=b,chocolate=h,gray=g,
|       |                           | green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
| 10    | stalk-shape               | enlarging=e,tapering=t
| 11    | stalk-root                | bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
| 12    | stalk-surface-above-ring  | fibrous=f,scaly=y,silky=k,smooth=s
| 13    | stalk-surface-below-ring  | fibrous=f,scaly=y,silky=k,smooth=s
| 14    | stalk-color-above-ring    | brown=n,buff=b,cinnamon=c,gray=g,orange=o,
|       |                           | pink=p,red=e,white=w,yellow=y
| 15    | stalk-color-below-ring    | brown=n,buff=b,cinnamon=c,gray=g,orange=o,
|       |                           | pink=p,red=e,white=w,yellow=y
| 16    | veil-type                 | partial=p,universal=u
| 17    | veil-color                | brown=n,orange=o,white=w,yellow=y
| 18    | ring-number               | none=n,one=o,two=t
| 19    | ring-type                 | cobwebby=c,evanescent=e,flaring=f,large=l,
|       |                           | none=n,pendant=p,sheathing=s,zone=z
| 20    | spore-print-color         | black=k,brown=n,buff=b,chocolate=h,green=r,
|       |                           | orange=o,purple=u,white=w,yellow=y
| 21    | population                | abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
| 22    | habitat                   | grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

The 11th attribute has possible missing data, denoted by a "?" - any observations with missing data will not be included in the dataset. The class label is the first element in the list.

The following three functions are taken from Programming Assignment 3 to parse data, create folds in data, and separate the data into training and test sets. The `parse_data` function is augmented to generate lists of characters rather than floats, and excludes rows with "?" characters.

The tree is implemented as a list of tuples of the form `(Attribute_name, Attribute_value, child)`. For a tree of the following shape, with attributes in order as X, Y, Z:

```.
└── A/
    ├── B/
    │   ├── C
    │   └── D
    └── E/
        ├── F
        └── G
```

We would represent this as: 
```
[(X, A, 
    [(Y, B, 
        [(Z, C, ""), 
         (Z, D, "")]), 
     (Y, E, 
        [(Z, F, ""), 
         (Z, G, "")]
     )
    ]
   )
 ]
```

Here, leaf nodes have a single element for their `child` value, and internal nodes have a list of tuples. 

<a id="parse_data"></a>
## parse_data

In [2]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [value for value in line.rstrip().split(",")]
        if "?" not in datum: data.append(datum)
    random.shuffle(data)
    return data

In [3]:
data = parse_data("agaricus-lepiota.data")
len(data) # 8124 observations, 2480 observations with missing data

5644

In [4]:
print(data[1])

['e', 'f', 'y', 'g', 't', 'n', 'f', 'c', 'b', 'u', 't', 'b', 's', 's', 'g', 'w', 'p', 'w', 'o', 'p', 'n', 'y', 'd']


<a id="create_folds"></a>
## create_folds

In [5]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [6]:
folds = create_folds(data, 10)

In [7]:
len(folds)

10

<a id="create_train_test"></a>
## create_train_test

In [8]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

In [9]:
train, test = create_train_test(folds, 0)

In [10]:
len(train)

5079

In [11]:
len(test)

565

# Functions

<a id="is_homogenous"></a>
## is_homogenous

`is_homogeneous` takes a dataset and returns whether it is homogeneous (whether all class labels are the same). This check is one of the base cases of the `id3` algorithm and determines whether branching needs to occur on the given dataset, or whether a class label can be assigned to the dataset in the decision tree. **Used by**: [id3](#id3).

* **data**: the dataset to determine homogeneity of

**returns** `bool`: whether the dataset is homogeneous.

In [12]:
def is_homogeneous(data) -> bool:
    curr_label = data[0][0] if data and data[0] else ""
    return all([row[0] == curr_label for row in data]) and data

In [13]:
# assertions/unit tests
data = [['y', 's', 'l', 'g'],
        ['y', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['y', 'r', 's', 'g']]
assert is_homogeneous(data)

data = [['y', 's', 'l', 'g'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 'r', 's', 'g']]
assert not is_homogeneous(data)

assert not is_homogeneous([])

<a id="majority_label"></a>
## majority_label

`majority_label` takes a dataset and returns the majority class label - this is used as a class label for non-homogeneous base cases, where the dataset has no more attributes left to split on. **Used by**: [id3](#id3). [train](#train).

* **data**: the dataset to find the majority_label of

**returns** `str`: the majority class label of `data`.

In [14]:
def majority_label(data) -> str:
    label_counts = {}
    for row in data:
        label_counts[row[0]] = label_counts[row[0]] if row[0] in label_counts else 1
    return max(label_counts, key=label_counts.get) if label_counts else ""

In [15]:
# assertions/unit tests
data = [['y', 's', 'l', 'g'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 'r', 's', 'g']]

label = majority_label(data)
assert label == "y"

data = [['y', 's', 'l', 'g'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['n', 's', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 'r', 's', 'g']]
label = majority_label(data)
assert label == "y"

assert majority_label([]) == ""

<a id="set_entropy"></a>
## set_entropy

`set_entropy` determines the entropy of the given set. Entropy is defined by the formula: 
$$E = -\sum_i \frac{p_i}{n}\log_2(\frac{p_i}{n})$$

$p_i$ is the number of observations with class label `i`, and we sum this over all class labels. This is a measure of how homogeneous the dataset is, with 0 being a homogeneous dataset and 1 being a perfectly split dataset (50/50). Entropy is used to determine information gain in the `id3` algorithm. **Used by**: [pick_best_attr](#pick_best_attr).

* **dataset**: the dataset to compute the entropy of

**returns** `float`: the entropy of `dataset`.

In [16]:
def set_entropy(dataset) -> float:
    counts = {}
    for row in dataset:
        counts[row[0]] = counts.get(row[0], 0) + 1
    total_count = sum(counts.values())
    entropy = 0
    for val in counts.keys():
        entropy += (counts[val] / total_count)*log2((counts[val] / total_count))
    return -entropy

In [17]:
data = [['y', 's', 'l', 'g'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 'r', 's', 'g']]
e_s = set_entropy(data)
assert abs(e_s - 0.918) < 0.005

data = [['n', 'r', 'l', 'b'],
        ['n', 's', 's', 'b'],
        ['n', 'r', 's', 'b']]
e_s = set_entropy(data)
assert e_s == 0

data = [['n', 's', 's', 'r'],
        ['y', 'r', 'l', 'r'],
        ['y', 'r', 's', 'r'],
        ['n', 's', 'l', 'r'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'r']]
e_s = set_entropy(data)
assert e_s == 1.0

<a id="find_subset"></a>
## find_subset

`find_subset` takes a set of `data` and an attribute/value pair and returns a subset of `data` where the value of `best_attr` for each observation is `value`. In other words, it takes only data that has the given value for the given attribute from `data`. These subsets are used for both computing entropies as well as narrowing down the dataset in the recursive portion of the `id3` algorithm. 

In finding the entropy of a heterogeneous dataset, we can iteratively take the subsets matching each attribute and compute the entropies for each subset and compute their weighted sum in order to find the total information gain for splitting on one attribute (for example, if we had the attribute `Shape`, we would iteratively find subsets that matched to values `square` and `round`, compute entropies for each, and total them for the weighted sum entropy of `Shape`). **Used by**: [pick_best_attr](#pick_best_attr), [id3](#id3).

* **data**: the data to find a subset from
* **best_attr**: the attribute as a column index in `data`
* **value**: the chosen value for `best_attr` to match

**returns** `List`: a subset of `data`.

In [18]:
def find_subset(data, best_attr, value) -> List:
    subset = []
    for row in data:
        if row[best_attr] == value:
            subset.append(deepcopy(row))
    return subset

In [19]:
# assertions/unit tests
data = [["a", "b", "c", "d", "e"],
        ["b", "e", "b", "c", "d"],
        ["a", "f", "g", "h", "i"],
        ["a", "e", "x", "e", "f"]]
subset = find_subset(data, 1, "b")
assert subset == [data[0]]

data = [["a", "b", "c", "d", "e"],
        ["b", "e", "b", "c", "d"],
        ["a", "f", "g", "h", "i"],
        ["a", "e", "x", "e", "f"],
        ["x", "e", "f", "g", "n"]]
subset = find_subset(data, 1, "e")
assert subset == [["b", "e", "b", "c", "d"], ["a", "e", "x", "e", "f"], ["x", "e", "f", "g", "n"]]

data = [["a", "b", "c", "d", "e"],
        ["b", "e", "b", "c", "d"],
        ["a", "f", "g", "h", "i"],
        ["a", "e", "x", "e", "f"]]
subset = find_subset(data, 0, "c")
assert not subset

<a id="pick_best_attr"></a>
## pick_best_attr

`pick_best_attr` is the method by which `id3` chooses which attribute to recurse on next. By computing the information gain, which is the difference in start entropy and entropy after splitting on any given attribute, we can determine which attribute gives us the most information (and which we should split on next). The function iterates over every attribute in `attributes` and computes this information gain by taking subsets of `data` for each value in the domain of each attribute. Finally, computing the weighted sum of each attribute value's entropy, we can compute information gain by taking the difference between the attribute entropy and the starting entropy. The function returns the attribute with the highest information gain. **Used by**: [id3](#id3).

* **data**: the starting dataset
* **attributes**: the remaining attributes to split on
* **domains**: a list of all values for each attribute

**returns** `int`: the attribute index which yields the highest information gain.

In [20]:
def pick_best_attr(data, attributes, domains) -> int:
    e_start = set_entropy(data)    
    curr_entropy, attr_entropies, info_gain = 0, {}, {}
    for attr in attributes:
        curr_entropy = 0
        for attr_val in domains[attr]:
            subset = find_subset(data, attr, attr_val)
            curr_entropy += set_entropy(subset)*(len(subset)/(len(data)))
        attr_entropies[attr] = curr_entropy
        info_gain[attr] = e_start - curr_entropy
    return max(info_gain, key=info_gain.get)

In [21]:
# assertions/unit tests
data = [['y', 's', 'l', 'g'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 'r', 's', 'g']]
attributes = [1, 2]
domains = [['y', 'n'], ['s', 'r'], ['l', 's'], ['g', 'r', 'b']]
best_attr = pick_best_attr(data, attributes, domains)
assert best_attr == 2

data = [['n', 's', 's', 'r'],
        ['y', 'r', 'l', 'r'],
        ['y', 'r', 's', 'r'],
        ['n', 's', 'l', 'r'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'r']]
best_attr = pick_best_attr(data, attributes, domains)
assert best_attr == 1

data = [['n', 'r', 'l', 'b'], 
        ['y', 's', 'l', 'g'],
        ['n', 's', 's', 'r'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'b'],
        ['n', 'r', 's', 'b'],
        ['y', 'r', 's', 'r'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 's', 'l', 'r'],
        ['y', 's', 'l', 'g'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'r'],
        ['n', 'r', 's', 'g']]
attributes = [1, 2, 3]
best_attr = pick_best_attr(data, attributes, domains)
assert best_attr == 2

<a id="attribute_minus"></a>
## attribute_minus

`attribute_minus` returns `attributes` without `best_attr` - this is a helper function to remove attributes that have already been split on in `id3`. **Used by**: [id3](#id3).

* **attributes**: the list of current attributes
* **best_attr**: the attribute to remove

**returns** `List`: a copy of `attributes` with `best_attr` removed.

In [22]:
def attribute_minus(attributes, best_attr) -> List:
    new_attributes = deepcopy(attributes)
    if best_attr in attributes: new_attributes.remove(best_attr)
    return new_attributes

In [23]:
# assertions/unit tests
attr = [1, 2, 3, 4]
new_attr = attribute_minus(attr, 2)
assert new_attr == [1, 3, 4]
assert attr == [1, 2, 3, 4]

assert attribute_minus([], 1) == []

<a id="add_child"></a>
## add_child

`add_child` adds a new node of the form `(best_attr, value, child)` to `node` and returns `node`. This is a helper function to append nodes to the decision tree produced by `id3`. **Used by**: [id3](#id3).

* **node**: the current tree
* **best_attr**: the new node's attribute
* **value**: the new node's attribute's index
* **child**: the new node's child/children

**returns** `List`: `node` with the new node attached.

In [24]:
def add_child(node, best_attr, value, child) -> List:
    new_node = deepcopy(node)
    new_node.append((deepcopy(best_attr), deepcopy(value), deepcopy(child)))
    return new_node

In [25]:
# assertions/unit tests
node = []
new_node = add_child(node, 0, "e", [])
assert new_node == [(0, "e", [])]
assert node == []

new_node = add_child(node, 1, "f", [(2, "f", "e"), (2, "x", "p")])
assert new_node == [(1, 'f', [(2, 'f', 'e'), (2, 'x', 'p')])]

<a id="id3"></a>
## id3

`id3` is a decision-tree generating algorithm that takes a dataset and returns a decision tree for classification trained on `data`. The recursive algorithm takes a set of training data, a list of current attributes, the domains for all attributes, and a default majority class label. 

`id3` works by splitting on attributes and creating children for each value in the domain of that attribute - the attributes chosen decrease the entropy of the dataset and slowly bring it closer to homogeneity. In the base case, the dataset passed in is homogeneous, and the class label is the child of the previous node. All leavse of the decision tree have a class label as their children, while interior nodes have a list of nodes as their children. **Uses**: [is_homogeneous](#is_homogeneous), [majority_label](#majority_label), [pick_best_attr](#pick_best_attr), [find_subset](#find_subset), [attribute_minus](#attribute_minus), [add_child](#add_child). **Used by**: [train](#train).

* **data**: the training data to build the decision tree from
* **attributes**: a list of column indices representing attributes
* **domains**: a list of values for every attribute
* **default**: the majority class label of `data`.

**returns** `List`: a decision tree

In [26]:
def id3(data, attributes, domains, default):
    if not data: return default
    if is_homogeneous(data): 
        return deepcopy(data[0][0])
    if not attributes: return majority_label(data)
    best_attr = pick_best_attr(data, attributes, domains) 
    node = []
    default_label = majority_label(data)
    for value in domains[best_attr]: 
        subset = find_subset(data, best_attr, value)
        new_attributes = attribute_minus(attributes, best_attr)
        child = id3(subset, new_attributes, domains, default_label)
        node = add_child(node, best_attr, value, child)
    return node

In [27]:
# assertions/unit tests
data = [['y', 's', 'l', 'g'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 'r', 's', 'g']]

attributes = [1, 2, 3]
domains = [['n', 'y'], ['r', 's'], ['l', 's'], ['b', 'g', 'r']]
tree = id3(data, attributes, domains, 'n')
assert tree == [(2, 'l', 'y'), (2, 's', 'n')]

data = [['n', 's', 's', 'r'],
        ['y', 'r', 'l', 'r'],
        ['y', 'r', 's', 'r'],
        ['n', 's', 'l', 'r'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'r']]
tree = id3(data, attributes, domains, 'n')
assert tree == [(1, 'r', 'y'), (1, 's', 'n')]

data = [['n', 'r', 'l', 'b'], 
        ['y', 's', 'l', 'g'],
        ['y', 'r', 'l', 'r'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 's', 'l', 'r'],
        ['y', 's', 'l', 'g'],
        ['y', 'r', 'l', 'r']]
tree = id3(data, attributes, domains, 'n')
assert tree == [(3, 'b', 'n'), (3, 'g', 'y'), (3, 'r', [(1, 'r', 'y'), (1, 's', 'n')])]

# Model Performance and Analysis

<a id="train"></a>
## train

`train` trains a decision tree from `data` and `domains`. The tree is returned. **Uses**: [majority_label](#majority_label), [id3](#id3). **Used by**: [cross_validate](#cross_validate).

* **data**: the training data
* **domains**: the domains of `data`

**returns** `List`: a decision tree

In [28]:
def train(data: List, domains: List) -> List:
    attributes = [i for i in range(1, len(data[0]))]
    default = majority_label(data)
    return id3(deepcopy(data), attributes, domains, default)

In [29]:
# assertions/unit tests
data = [['y', 's', 'l', 'g'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 'r', 's', 'g']]
domains = [['n', 'y'], ['r', 's'], ['l', 's'], ['b', 'g', 'r']]
assert train(data, domains) == [(2, 'l', 'y'), (2, 's', 'n')]

data = [['n', 's', 's', 'r'],
        ['y', 'r', 'l', 'r'],
        ['y', 'r', 's', 'r'],
        ['n', 's', 'l', 'r'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'r']]
assert train(data, domains) == [(1, 'r', 'y'), (1, 's', 'n')]

data = [['n', 'r', 'l', 'b'], 
        ['y', 's', 'l', 'g'],
        ['y', 'r', 'l', 'r'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 's', 'l', 'r'],
        ['y', 's', 'l', 'g'],
        ['y', 'r', 'l', 'r']]
assert train(data, domains) == [(3, 'b', 'n'), (3, 'g', 'y'), (3, 'r', [(1, 'r', 'y'), (1, 's', 'n')])]

<a id="classify"></a>
## classify

`classify` takes a decision tree and a list of observations, and an optional flag for whether the observations are labeled, and returns a list of classifications for every observation. If `labeled` is `False`, a column of `None` values is prepended to the data before classification.

The function iteratively matches attributes from the observation in the tree until a leaf node is reached, at which point the child of the leaf node is appended as the classification to the observation. **Used by**: [cross_validate](#cross_validate).

* **decision_tree**: the decision tree used for classification
* **observations**: the observations to classify
* **labeled**: an optional parameter that specifies whether `observations` is labeled

**returns** `List`: a list of classifications

In [30]:
def classify(decision_tree: List, observations: List, labeled=True) -> List:
    dataset = deepcopy(observations) if labeled else [[None] + deepcopy(row) for row in observations]
    terminated, classifications = False, []
    for row in dataset:
        child, curr_node = None, decision_tree
        while type(child) != str:
            for attr, attr_val, child in curr_node:
                obs_val = row[attr]
                if obs_val == attr_val:
                    curr_node = child
                    break
        classifications.append(child)
    return classifications

In [31]:
# assertions/unit tests
data = [['n', 'r', 'l', 'b'], 
        ['y', 's', 'l', 'g'],
        ['n', 's', 's', 'r'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'b'],
        ['n', 'r', 's', 'b'],
        ['y', 'r', 's', 'r'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 's', 'l', 'r'],
        ['y', 's', 'l', 'g'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'r'],
        ['n', 'r', 's', 'g']]
attributes = [1, 2, 3]
domains = [['n', 'y'], ['r', 's'], ['l', 's'], ['b', 'g', 'r']]
tree = id3(data, attributes, domains, 'n')

observation = [['n', 's', 's', 'g']]
classifications = classify(tree, observation)
assert classifications == ['n']

observations = [['s', 's', 'r'], ['r', 's', 'b']]
classifications = classify(tree, observations, False)
assert classifications == ['n', 'n']

observations = [deepcopy(row[1:]) for row in data]
classifications = classify(tree, observations, False)
assert classifications == [row[0] for row in data]

<a id="evaluate"></a>
## evaluate

`evaluate` takes a list of classifications and labeled data and returns the error rate of the classifications. Error rate is: 
$$error\_rate=\frac{errors}{n}$$
**Used by**: [cross_validate](#cross_validate).

* **labeled_data**: the real values for labels to compare to
* **classifications**: the estimates to determine error rate for

**returns** `float`: the error rate as a float.

In [32]:
def evaluate(labeled_data, classifications) -> float:
    num_errors = 0
    labels = [row[0] for row in labeled_data]
    for actual_label, classified_label in zip(labels, classifications):
        if actual_label != classified_label: num_errors += 1
    return num_errors / len(classifications)

In [33]:
# assertions/unit tests
data = [['n', 'r', 'l', 'b'], 
        ['y', 's', 'l', 'g'],
        ['n', 's', 's', 'r'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'b'],
        ['n', 'r', 's', 'b'],
        ['y', 'r', 's', 'r'],
        ['n', 's', 's', 'g'],
        ['y', 'r', 'l', 'g'],
        ['y', 's', 'l', 'g'],
        ['n', 's', 'l', 'r'],
        ['y', 's', 'l', 'g'],
        ['y', 'r', 'l', 'r'],
        ['n', 's', 's', 'r'],
        ['n', 'r', 's', 'g']]
attributes = [1, 2, 3]
domains = [['n', 'y'], ['r', 's'], ['l', 's'], ['b', 'g', 'r']]
tree = id3(data, attributes, domains, 'n')
classifications = classify(tree, data)
assert evaluate(data, classifications) == 0

<a id="cross_validate"></a>
## cross_validate

`cross_validate` takes the data and uses 10 fold cross validation to `train`, `classify`, and `evaluate`. The function shuffles the data, splits the data into folds, and performs 10-fold cross validation on the folds. The error rate for each fold's evaluation is printed, and the average error rate is printed at the end. **Uses**: [create_folds](#create_folds), [create_train_test](#create_train_test), [train](#train), [classify](#classify), [evaluate](#evaluate).

In [34]:
def cross_validate(data, domains, classify):
    avg_err_rate = 0
    random.shuffle(data)
    folds = create_folds(data, 10)
    for i in range(10):
        train_data, test_data = create_train_test(folds, i)
        decision_tree = train(train_data, domains)
        classifications = classify(decision_tree, test_data)
        error_rate = evaluate(test_data, classifications)
        print("Fold", i, "error rate:", error_rate)
        avg_err_rate += error_rate
    avg_err_rate = avg_err_rate / 10
    print("Avg. error rate:", avg_err_rate)

In [35]:
domains = [['e', 'p'], 
               ['b', 'c', 'x', 'f', 'k', 's'], 
               ['f', 'g', 'y', 's'], 
               ['n', 'b', 'c', 'g', 'r', 'p', 'u', 'e', 'w', 'y'], 
               ['t', 'f'], 
               ['a', 'l', 'c', 'y', 'f', 'm', 'n', 'p', 's'], 
               ['a', 'd', 'f', 'n'], 
               ['c', 'w', 'd'], 
               ['b', 'n'], 
               ['k', 'n', 'b', 'h', 'g', 'r', 'o', 'p', 'u', 'e', 'w', 'y'], 
               ['e', 't'], 
               ['b', 'c', 'u', 'e', 'z', 'r'], 
               ['f', 'y', 'k', 's'], 
               ['f', 'y', 'k', 's'], 
               ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y'], 
               ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y'], 
               ['p', 'u'], 
               ['n', 'o', 'w', 'y'], 
               ['n', 'o', 't'], 
               ['c', 'e', 'f', 'l', 'n', 'p', 's', 'z'], 
               ['k', 'n', 'b', 'h', 'r', 'o', 'u', 'w', 'y'], 
               ['a', 'c', 'n', 's', 'v', 'y'], 
               ['g', 'l', 'm', 'p', 'u', 'w', 'd']]
data = parse_data("agaricus-lepiota.data")
cross_validate(data, domains, classify)

Fold 0 error rate: 0.0
Fold 1 error rate: 0.0
Fold 2 error rate: 0.0
Fold 3 error rate: 0.0
Fold 4 error rate: 0.0
Fold 5 error rate: 0.0
Fold 6 error rate: 0.0
Fold 7 error rate: 0.0
Fold 8 error rate: 0.0
Fold 9 error rate: 0.0
Avg. error rate: 0.0


<a id="pretty_print_tree"></a>
## pretty_print_tree

`pretty_print_tree` prints the given tree in a nested style, with each child denoted by `---->` from the parent. The nodes are printed in order, so indentation specifies which nodes are children of which parents.

In [36]:
def pretty_print_tree(tree, tabs):
#     make it look like this: 
    tabs_str = "-" * (tabs * 4)
    tabs_str += '>'
    for node in tree:
        if type(node[2]) == str: print(tabs_str, node)
        else:
            print(tabs_str, node[0:2], "...")
            pretty_print_tree(node[2], tabs + 1)

In [37]:
tree = train(data, domains)
pretty_print_tree(tree, 0)

> (5, 'a', 'e')
> (5, 'l', 'e')
> (5, 'c', 'p')
> (5, 'y', 'p')
> (5, 'f', 'p')
> (5, 'm', 'p')
> (5, 'n') ...
----> (20, 'k', 'e')
----> (20, 'n', 'e')
----> (20, 'b', 'e')
----> (20, 'h', 'e')
----> (20, 'r', 'p')
----> (20, 'o', 'e')
----> (20, 'u', 'e')
----> (20, 'w') ...
--------> (3, 'n', 'e')
--------> (3, 'b', 'e')
--------> (3, 'c', 'e')
--------> (3, 'g', 'e')
--------> (3, 'r', 'e')
--------> (3, 'p', 'e')
--------> (3, 'u', 'e')
--------> (3, 'e', 'e')
--------> (3, 'w', 'p')
--------> (3, 'y', 'p')
----> (20, 'y', 'e')
> (5, 'p', 'p')
> (5, 's', 'p')


## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.