# Artificial Intelligence
# 464/664
# Assignment #6

## General Directions for this Assignment

00. We're using a Jupyter Notebook environment (tutorial available here: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html),
01. Read the entire notebook before beginning your work, 
02. Output format should be exactly as requested (it is your responsibility to make sure notebook looks as expected on Gradescope),
03. Each helper function should be preceeded by documentation (Markdown cell), 
04. Each helper function (not `train`) should be followed by three assert-style unit tests,
05. **Do not use any AI/ML libraries, packages, such as pandas, scikit (numpy is fine)**
06. Functions should do only one thing,
07. Check submission deadline on Gradescope, 
08. Rename the file to Last_First_assignment_6, 
09. Submit your notebook (as .ipynb, not PDF) using Gradescope, and
10. Do not submit any other files.

## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".

## Decision Trees

For this assignment we will implement a Decision Tree using the ID3 Algorithm. The goal is classify a mushroom as either edible ('e') or poisonous ('p'). Dataset has been uploaded to Canvas. In case you'd like to learn more about it, here's the link to the repo: https://archive.ics.uci.edu/dataset/73/mushroom. 


Our  Decision Tree pipeline is as follows:


1) `cross_validate` will take data (supplied as folds using 10 fold cross validation) and do the following:
* For each setting of depth limit (the hyperparameter in decision trees, including 0)
* * and for each fold of data
* * * use `create_train_test` to split current fold into train and test
* * * call `train` to build and return a decision tree, 
* * * call `classify` to use the tree to get classifications,
* * * call `evaluate` to compare classifications to the actual answers (ground truth),
* * * Print the performance for that fold
* * Summarize the performance for that depth limit over all folds using `get_stats`


2) `pretty_print_tree(tree)` will print what the tree looks like when using the **entire** data set (no train/test split) with depth limit set to None.


All the code in this pipeline has been provided, except for a working `train` function. The `train` function currently returns a hard-coded tree from our lecture. Don't do that. Use ID3 to build your tree and use the depth limit to stop. When you're train function is complete, it should work for the lecture data, and mushrooms. Although `train` is terrible right now, pay attention to how the tree is structured.

In [1]:
import random
import math
import copy
from copy import deepcopy
from typing import List, Dict, Tuple, Callable

<a id="note"></a>

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Note</strong>
    <p>
        Let's start with our example from the 06-Nov lecture. Target variable is Safe?, which can be yes or no. Anything *_lecture refers to the dataset we walked through in class.  
    </p>
</div>

In [2]:
data_lecture = [['round','large','blue','no'],
['square','large','green','yes'],
['square','small','red','no'],
['round','large','red','yes'],
['square','small','blue','no'],
['round','small','blue','no'],
['round','small','red','yes'],
['square','small','green','no'],
['round','large','green','yes'],
['square','large','green','yes'],
['square','large','red','no'],
['square','large','green','yes'],
['round','large','red','yes'],
['square','small','red','no'],
['round','small','green','no']]

In [3]:
print(data_lecture[0]) # a record of data

['round', 'large', 'blue', 'no']


In [4]:
len(data_lecture)

15

In [5]:
attribute_names_lecture = ['shape', 
                      'size', 
                      'color']

<a id="create_folds"></a>
## create_folds


With n-fold cross validation, we divide our data set into n subgroups called "folds" and then use those folds for training and testing. For data set with 100 observations (or records), n set to 10 would have 10 observations in each fold.

* **data** List: a list (data_lecture, for instance)
* **n** int: number of folds


**returns** 
folds, which is a list of n items, where each item is a list containing a subgroup of xs

In [6]:
def create_folds(data: List, n: int) -> List[List[List]]:
    k, m = divmod(len(data), n)
    return list(data[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [7]:
folds_lecture = create_folds(data=data_lecture, n=10)

In [8]:
len(folds_lecture)

10

In [9]:
print(folds_lecture[0])

[['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes']]


In [10]:
print(folds_lecture[1])

[['square', 'small', 'red', 'no'], ['round', 'large', 'red', 'yes']]


<a id="create_train_test"></a>
## create_train_test


This function takes the n folds and returns the train and test sets. One of the n folds is used to test, the others are used for training.

* **folds** List[List[List]]: see `create_folds`
* **index** int: fold index that is used for testing


**returns** 
folds, which is a list of n items, where each item is a list containing a subgroup of xs

In [11]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

In [12]:
train_lecture, test_lecture = create_train_test(folds_lecture, 0) # test data is folds_lecture index 0

In [13]:
print(train_lecture)

[['square', 'small', 'red', 'no'], ['round', 'large', 'red', 'yes'], ['square', 'small', 'blue', 'no'], ['round', 'small', 'blue', 'no'], ['round', 'small', 'red', 'yes'], ['square', 'small', 'green', 'no'], ['round', 'large', 'green', 'yes'], ['square', 'large', 'green', 'yes'], ['square', 'large', 'red', 'no'], ['square', 'large', 'green', 'yes'], ['round', 'large', 'red', 'yes'], ['square', 'small', 'red', 'no'], ['round', 'small', 'green', 'no']]


In [14]:
print(test_lecture)

[['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes']]


In [15]:
train_lecture, test_lecture = create_train_test(folds_lecture, 1) # test data is folds_lecture index 1

In [16]:
print(train_lecture)

[['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes'], ['square', 'small', 'blue', 'no'], ['round', 'small', 'blue', 'no'], ['round', 'small', 'red', 'yes'], ['square', 'small', 'green', 'no'], ['round', 'large', 'green', 'yes'], ['square', 'large', 'green', 'yes'], ['square', 'large', 'red', 'no'], ['square', 'large', 'green', 'yes'], ['round', 'large', 'red', 'yes'], ['square', 'small', 'red', 'no'], ['round', 'small', 'green', 'no']]


In [17]:
print(test_lecture)

[['square', 'small', 'red', 'no'], ['round', 'large', 'red', 'yes']]


<a id="note"></a>

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <p>
        Let's load the mushroom data.
    </p>
</div>

<a id="parse_data"></a>
## parse_data

Opens a file, splits on comma, and shuffles data before returning as a List of list. 

* **file_name** Str: filename for data


**returns** 
Data as a list of a list.

In [18]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [value for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [19]:
data_mushroom = parse_data("agaricus-lepiota.data")

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        We're going to move the target column (mushroom edible or poisonous) to the last column to match the lecture's format, where Safe? was at the end.
    </p>
</div>

In [20]:
data_mushroom = [record[1:]+[record[0]] for record in data_mushroom]

In [21]:
len(data_mushroom)

8124

In [22]:
print(data_mushroom[0])

['x', 'f', 'y', 'f', 'f', 'f', 'c', 'b', 'h', 'e', 'b', 'k', 'k', 'b', 'n', 'p', 'w', 'o', 'l', 'h', 'v', 'd', 'p']


In [23]:
attribute_names_mushroom = ['cap-shape',
                   'cap-surface',
                   'cap-color',
                   'bruises?',
                   'odor',
                   'gill-attachment',
                   'gill-spacing',
                   'gill-size',
                   'gill-color',
                   'stalk-shape',
                   'stalk-root',
                   'stalk-surface-above-ring',
                   'stalk-surface-below-ring',
                   'stalk-color-above-ring',
                   'stalk-color-below-ring',
                   'veil-type',
                   'veil-color',
                   'ring-number',
                   'ring-type',
                   'spore-print-color',
                   'population',
                   'habitat']

<a id="get_answers"></a>
## get_answers

This function extracts a list of the target values from data. The function assumes the target variable is the last column of the data.

* **data** List[List]: The data provided in a list of list format identical to the structure of `data_lecture` or `data_mushroom`


**returns** 
A list of the values of the target variable.

In [24]:
def get_answers(data):
    return [record[-1] for record in data]

In [25]:
assert get_answers([]) == []
assert get_answers(data_lecture) == ['no', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'no']

<a id="get_mode"></a>
## get_mode

This function finds the mode of a list of items. 

* **answers** List: A list of items

**returns** 
The item that appears the most often in the list. 

In [26]:
def get_mode(answers):
    count_dict = {}
    for answer in answers:
        if answer in count_dict:
            count_dict[answer] = count_dict[answer] + 1
        else:
            count_dict[answer] = 1
    mode_count = max(count_dict.values())
    mode = [k for k, v in count_dict.items() if v == mode_count]
    return mode[0]

In [27]:
assert get_mode(['no', 'no', 'no', 'yes']) == 'no'
assert get_mode(['no', 'no', 'yes', 'yes']) == 'no'
assert get_mode(['no', 'yes', 'yes', 'yes']) == 'yes'

<a id="get_labels"></a>
## get_labels

This function extracts the unique labels from the training data. The labels are typically found in the last column of the data, and this function identifies and returns them.

* **training_data** (List): A list representing the training data, where each sublist corresponds to a data point.

**returns**
A list of unique labels present in the last column of the training data.

In [28]:
def get_labels(training_data) :
    label_location = len(training_data[0]) - 1
    labels = []
    for data in training_data:
        label = data[label_location]
        if label not in labels : labels.append(label)
    return labels

In [29]:
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes'], ['square', 'small', 'blue', 'no']]
test_labels = ['no','yes']
assert get_labels(test_training_data) == test_labels
test_training_data = [['round', 'large', 'blue', 'no'],['round', 'large', 'blue', 'yes'],['round', 'large', 'blue', 'N/A']]
test_labels = ['no','yes','N/A']
assert get_labels(test_training_data) == test_labels
test_training_data = [['round', 'large', 'blue', 'yes'], ['square', 'large', 'green', 'yes'], ['square', 'small', 'blue', 'yes']]
test_labels = ['yes']
assert get_labels(test_training_data) == test_labels

<a id="get_majority_label"></a>
## get_majority_label

This function determines the majority label for a given set of training data. It counts the occurrences of each label in the last column of the data and identifies the label that occurs most frequently. If a single label dominates the entire dataset, it is considered homogeneous.

* **training_data** (List): A list representing the training data, where each sublist corresponds to a data point.
* **labels** (List): A list of unique labels present in the last column of the training data.

**returns**
A tuple containing two values:
1. A boolean indicating whether the dataset is homogeneous based on the labels.
2. If homogeneous, the single label dominating the dataset; otherwise, the label with the maximum occurrence.


In [30]:
def get_majority_label(training_data,labels) :
    label_location = len(training_data[0]) - 1
    label_map = {label: 0 for label in labels}
    num_data = len(training_data)
    for data in training_data:
        label = data[label_location]
        last_labeled = label
        label_map[label] = label_map[label] + 1
    if label_map[last_labeled] == num_data: return True,last_labeled
    max_label = max(label_map, key=lambda r: label_map[r])
    return False,max_label

In [31]:
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes'], ['square', 'small', 'blue', 'no']]
test_is_homogenous,test_label = get_majority_label(test_training_data,get_labels(test_training_data))
assert test_is_homogenous == False
assert test_label == "no"
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'no'], ['square', 'small', 'blue', 'no']]
test_is_homogenous,test_label = get_majority_label(test_training_data,get_labels(test_training_data))
assert test_is_homogenous == True
assert test_label == "no"

<a id="get_attr_domain"></a>
## get_attr_domain

This function extracts the unique values of a specific attribute (column) from the training data. 

* **training_data** (List): A list representing the training data, where each sublist corresponds to a data point.
* **attr_location** (int): The index representing the location of the attribute (column) for which the domain is to be obtained.

**returns**
A list containing the unique values of the specified attribute in the training data.


In [32]:
def get_attr_domain(training_data,attr_location):
    attr_domain = []
    for data in training_data:
        value = data[attr_location]
        if value not in attr_domain: attr_domain.append(value)
    return attr_domain

 

In [33]:
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes'], ['square', 'small', 'blue', 'no']]
assert get_attr_domain(test_training_data,0) == ["round","square"]   
assert get_attr_domain(test_training_data,1) == ["large","small"]   
assert get_attr_domain(test_training_data,2) == ["blue","green"]  

<a id="get_probabilities"></a>
## get_probabilities

This function counts occurences of a result in condition of the value in the attribute column

* **label_location** (int): The index representing the location of the label (answer) in each data point.
* **attr_location** (int): The index representing the location of the attribute for which probabilities are calculated.
* **training_data** (List): A list representing the training data, where each sublist corresponds to a data point.
* **labels** (List): A list of unique labels present in the label column of the training data.

**returns**
A tuple containing two dictionaries:
1. A nested dictionary (`values`) where each key is an attribute value, and the corresponding value is a dictionary representing label occurrences for that attribute value.
2. A dictionary (`sums`) where each key is an attribute value, and the corresponding value is the sum of label occurrences for that attribute value.



In [34]:
def get_probabilities(label_location,attr_location,training_data,labels):
    attr_domain = get_attr_domain(training_data,attr_location)
    values = {value: {label:0 for label in labels} for value in attr_domain}
    for data in training_data:
        attr_value = data[attr_location]
        label = data[label_location]
        value_map = values[attr_value]
        value_map[label] = value_map[label] + 1
    sums = {}
    for value in values:
        sum = 0
        value_map = values[value]
        for label in value_map:
            sum = sum + value_map[label]
        sums[value] = sum
    return values,sums

In [35]:
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes'], ['square', 'small', 'blue', 'no']]
test_label_location = 3
test_attr_location = 2
test_labels = ['no','yes']
test_probabilities,test_sums = get_probabilities(test_label_location,test_attr_location,test_training_data,test_labels)
assert test_sums == {'blue':2,'green':1}
assert test_probabilities == {'blue':{'no':2,'yes':0},'green':{'no':0,'yes':1}}
test_attr_location = 1
test_probabilities,test_sums = get_probabilities(test_label_location,test_attr_location,test_training_data,test_labels)
assert test_sums == {'large':2,'small':1}
assert test_probabilities == {'large':{'no':1,'yes':1},'small':{'no':1,'yes':0}}

<a id="calculate_entropy"></a>
## calculate_entropy

This function computes the entropy of a set of probabilities given their occurrences and corresponding sums. 

* **probabilities** (Dict): A nested dictionary where each key is an attribute value, and the corresponding value is a dictionary representing label occurrences for that attribute value.
* **sums** (Dict): A dictionary where each key is an attribute value, and the corresponding value is the sum of label occurrences for that attribute value.
* **num_items** (int): The total number of items in the dataset.

**returns**
The entropy value calculated for the set of probabilities.



In [36]:
def calculate_entropy(probabilities,sums,num_items):
    total_entropy = 0
    for key in probabilities:
        entropy = 0
        probabilities_map = probabilities[key]
        for probability_key in probabilities_map:
            probability = probabilities_map[probability_key] / sums[key]
            if probability == 0: 
                entropy = 0
                break
            entropy = entropy + (probability * math.log2(probability))
        total_entropy = total_entropy + (-entropy * (sums[key]/num_items))
    return total_entropy

In [37]:
test_probabilities = {'blue':{'no':3,'yes':0},'green':{'no':2,'yes':4},'red':{'no':3,'yes':3}}
test_sums = {'blue':3,'green':6,'red':6}
assert round (calculate_entropy(test_probabilities,test_sums,15),2) == 0.77
test_probabilities = {'large':{'no':2,'yes':6},'small':{'no':6,'yes':1}}
test_sums = {'large':8,'small':7}
assert round (calculate_entropy(test_probabilities,test_sums,15),2) == 0.71
test_probabilities = {'round':{'no':3,'yes':4},'square':{'no':5,'yes':3}}
test_sums = {'round':7,'square':8}
assert round (calculate_entropy(test_probabilities,test_sums,15),2) == 0.97

<a id="get_base_entropy"></a>
## get_base_entropy

This function calculates the entropy of the base dataset by considering only the label occurrences. 

* **training_data** (List): A list representing the training data, where each sublist corresponds to a data point.
* **label_location** (int): The index representing the location of the label (class) in each data point.
* **labels** (List): A list of unique labels present in the label column of the training data.

**returns**
The entropy value calculated for the base dataset.



In [38]:
def get_base_entropy(training_data,label_location,labels):
    base_data = []
    BASE = "base"
    for data in training_data:
        base_data.append([BASE,data[label_location]])
    proabilities,sums = get_probabilities(1,0,base_data,labels)
    return calculate_entropy(proabilities,sums,len(training_data))

In [39]:
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes'], ['square', 'small', 'blue', 'no']]
test_labels = get_labels(test_training_data)
assert round(get_base_entropy(test_training_data,3,test_labels),2) == 0.92
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'no'], ['square', 'small', 'blue', 'no']]
assert round(get_base_entropy(test_training_data,3,test_labels),2) == 0
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes']]
assert round(get_base_entropy(test_training_data,3,test_labels),2) == 1

<a id="pick_best"></a>
## pick_best

This function identifies the best attribute to split the dataset based on the highest information gain. It calculates the information gain for each attribute and selects the one that maximizes the gain.

* **training_data** (List): A list representing the training data, where each sublist corresponds to a data point.
* **skips** (List): Indexes to not consider 
* **labels** (List) : A list of unique labels present in the label column of the training data.

**returns**
The index representing the location of the best attribute to split the dataset.


In [40]:
def pick_best(training_data,skips,labels):
    label_location = len(training_data[0]) - 1
    num_items = len(training_data)
    base_entropy = get_base_entropy(training_data,label_location,labels)
    max_attr_location,max_gain = 0,0
    for attr_location in range(label_location):
        if attr_location in skips: continue # skip column
        proabilities,sums = get_probabilities(label_location,attr_location,training_data,labels)
        entropy = calculate_entropy(proabilities,sums,num_items)
        curr_gain = base_entropy - entropy
        if curr_gain > max_gain: max_attr_location,max_gain = attr_location,curr_gain
    return max_attr_location

In [41]:
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes'], ['round', 'small', 'blue', 'no']]
test_labels = get_labels(test_training_data)
test_skips = []
assert pick_best(test_training_data,test_skips,test_labels) == 0
test_skips = [0]
assert pick_best(test_training_data,test_skips,test_labels) == 2
test_training_data = [['round', 'large', 'green', 'no'], ['square', 'small', 'green', 'no'], ['square', 'large', 'blue', 'yes']]
test_skips = [0,2]
assert pick_best(test_training_data,test_skips,test_labels) == 1

<a id="get_subset"></a>
## get_subset

This function extracts a subset of the training data where a specific attribute has a particular value. It filters the original dataset to include only the data points where the specified attribute matches the provided value.

* **training_data** (List): A list representing the training data, where each sublist corresponds to a data point.
* **attr** (string): The value of the attribute for which the subset is to be obtained.
* **attr_location** (int): The index representing the location of the attribute in each data point.

**returns**
A subset of the training data containing only the data points where the specified attribute matches the provided value.



In [42]:
def get_subset(training_data,attr,attr_location):
    subset = []
    for data in training_data:
        if data[attr_location] == attr: subset.append(data)
    return subset

In [43]:
test_training_data = [['round', 'large', 'blue', 'no'], ['square', 'large', 'green', 'yes'], ['round', 'small', 'blue', 'no']]
assert get_subset(test_training_data,'round',0) == [['round', 'large', 'blue', 'no'],['round', 'small', 'blue', 'no']]
assert get_subset(test_training_data,'square',0) == [['square', 'large', 'green', 'yes']]
assert get_subset(test_training_data,'green',2) == [['square', 'large', 'green', 'yes']]

<a id="id3"></a>
## id3

This function implements the ID3 (Iterative Dichotomiser 3) algorithm, a decision tree algorithm for classification. It recursively builds a decision tree based on the provided training data and attributes.

* **training_data** (List): A list representing the training data, where each sublist corresponds to a data point.
* **attribute_names** (List): A list containing the names of attributes in the training data.
* **skips** (List): Indexes of columns that have already been moved to in decision tree
* **depth_limit** (int): An optional parameter limiting the depth of the decision tree. If set to 0, the function returns the most frequent label; if not provided, the tree expands until homogeneity is reached.
* **labels** (List): A list of unique labels present in the label column of the training data.
* **default** (Any): The default label to return if the training data is empty.

**returns**
A nested dictionary representing the decision tree. Each node in the tree is a tuple containing the attribute name, attribute index, and attribute value. The leaf nodes contain the predicted label.


In [44]:
def id3(training_data,attribute_names,skips,labels,depth_limit,default='no'):
    if len(training_data) == 0: return default
    is_homogenous,majority_label = get_majority_label(training_data,labels)
    if is_homogenous or len(attribute_names) == len(skips): return majority_label
    if depth_limit != None:
        if depth_limit == 0: return majority_label
        depth_limit = depth_limit - 1
    best_attr_index = pick_best(training_data,skips,labels)
    best_attr_domain = get_attr_domain(training_data,best_attr_index)
    depth_map= {}
    for attr in best_attr_domain:
        subset = get_subset(training_data,attr,best_attr_index)
        next_skips = deepcopy(skips)
        next_skips.append(best_attr_index)
        node = (attribute_names[best_attr_index],best_attr_index,attr)
        depth_map[node] = id3(subset,attribute_names,next_skips,labels,depth_limit,majority_label)
    return depth_map

In [45]:
# Test Case 1: Empty Training Data
training_data_1 = []
attribute_names_1 = []
skips = []
depth_limit_1 = 2
labels_1 = ['no', 'yes']
default_1 = 'no'
assert id3(training_data_1, attribute_names_1, skips, labels_1, depth_limit_1,  default_1) == 'no'

# Test Case 2: Non-empty Training Data
training_data_2 = [['round', 'large', 'blue', 'no'],
                    ['square', 'large', 'green', 'yes'],
                    ['square', 'small', 'red', 'no'],
                    ['round', 'large', 'red', 'yes'],
                    ['square', 'small', 'blue', 'no']]
attribute_names_2 = ['shape', 'size', 'color']
skips = []
depth_limit_2 = 2

labels_2 = ['no', 'yes']
default_2 = 'no'
decision_tree_2 = id3(training_data_2, attribute_names_2, skips,labels_2, depth_limit_2, default_2)
expected_tree_2 = {('color', 2, 'blue'): 'no', ('color', 2, 'green'): 'yes', ('color', 2, 'red'): {('shape', 0, 'square'): 'no', ('shape', 0, 'round'): 'yes'}}
assert decision_tree_2 == expected_tree_2

# Test Case 3: Non-empty Training Data with Depth Limit
training_data_3 = [['round', 'large', 'blue', 'no'],
                    ['square', 'large', 'green', 'yes'],
                    ['square', 'small', 'red', 'no'],
                    ['round', 'large', 'red', 'yes'],
                    ['square', 'small', 'blue', 'no']]
attribute_names_3 = ['shape', 'size', 'color']
attr_map_3 = {'round': 0, 'square': 0, 'large': 1, 'small': 1, 'blue': 2, 'green': 2, 'red': 2}
depth_limit_3 = 0  # Depth limit set to 0
labels_3 = ['no', 'yes']
default_3 = 'no'
assert id3(training_data_3, attribute_names_3, attr_map_3,  labels_3, depth_limit_3, default_3) == 'no'

<a id="train"></a>
## train

This function takes training_data, attribute names, and the depth limit and returns the decision tree as a nested dictionary. If the depth is 0, a dictionary is not returned. Instead, the mode of the target values is returned (i.e., majority class). 

* **training_data** List[List]: The data
* **attribute_names** List: The attribute names of the data (22 for mushroom; size, shape, and color for the lecture)
* **depth_limit** int: The depth limit of the tree


**returns** 
* **dt** Dict: The trained decision tree using the ID3 algorithm (entropy, information gain). It is represented as a nested dictionary. The dictionary returned for the lecture is structured as below:

```
{
('size', 1, 'large'): 
    {('color', 2, 'blue'): 'no', 
     ('color', 2, 'green'): 'yes', 
     ('color', 2, 'red'): 
         {('shape', 0, 'round'): 'yes', 
          ('shape', 0, 'square'): 'no'}
     }, 
('size', 1, 'small'): 
     {('shape', 0, 'square'): 'no', 
      ('shape', 0, 'round'): 
          {('color', 2, 'blue'): 'no', 
           ('color', 2, 'red'): 'yes', 
           ('color', 2, 'green'): 'no'}
      }
}
```


Notice that the keys are tuples; for instance, ('size', 1, 'large') is a key. The key includes the attribute's name, column number in data, and value.


The function currently returns a hard-coded tree. Your implementation should replace this with a tree that is learned from the data using the ID3 algorithm. You do not have to assert test `train`, but it may be worthwhile to check that it can return the tree from the lecture once your implementation is in place.

In [46]:
def train(training_data, attribute_names, depth_limit=None):
    labels = get_labels(training_data)
    return id3(training_data,attribute_names,[],labels,depth_limit)

In [47]:
dt_lecture = train(training_data=train_lecture, attribute_names=attribute_names_lecture, depth_limit=0)

<a id="get_prediction"></a>
## get_prediction

This recursive function uses a decision tree represented as a nested dictionary get a prediction from a record, which is a row of the data. 

* **record** List[]: A row of data to be predicted
* **dt** the decision tree used to make the prediction


**returns** 
A prediction ('yes' or 'no' for instance, from our Self Check example.) 

In [48]:
def get_prediction(record, dt):
    if not isinstance(dt, dict):
        return dt
    else:
        for key, value in dt.items():
            if record[key[1]]==key[2]:
                return get_prediction(record, value)

In [49]:
print(get_prediction(['round','large','blue','no'], dt=dt_lecture))
print(get_prediction(['square','large','green','yes'], dt=dt_lecture))
print(get_prediction(['square','small','red','no'], dt=dt_lecture))

no
no
no


<a id="classify"></a>
## classify

This function takes a decision tree, observations, and a labeled flag to return a list of classifications. 

* **dt** Dict: The decision tree as a nested dictionary
* **observation** List[List]: a list of items, where each item is a row of the data
* **labeled** Bool: true for labeled data


**returns** 
* **y_hat** List: A list of classifications.

In [50]:
def classify(dt, observations):
    y_hat = []
    for record in observations:
        y_hat.append(get_prediction(record, dt))   
    return y_hat

In [51]:
print(classify(dt=dt_lecture, observations=test_lecture))

['no', 'no']


<a id="evaluate"></a>
## evaluate

This function evaluates the performance of a classifier. It takes a data set (training set or test set) and the classification result (see [classify](#classify) above and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$ 

* **y_hat** List: A list of predictions
* **observations** List[List]: Data to be predicted (typically training or test set)


**returns** 

* **error_rate** float: The error rate.

In [52]:
def evaluate(y_hat, observations):
    errors = 0
    ground_truth = get_answers(observations)
    for index in range(len(y_hat)):
        if y_hat[index] != ground_truth[index]:
            errors = errors + 1
    return errors / (len(y_hat))

In [53]:
print(evaluate(classify(dt=dt_lecture, observations=data_lecture), observations=data_lecture))

0.4666666666666667


<a id="get_stats"></a>
## get_stats

This function computes the mean and the standard deviation for a given list of observations. 

* **observations** List[float]: A list of observations


**returns** (mean, standard deviation) Tuple[float,float]: tuple consisting of mean and the standard deviation

In [54]:
def get_stats(observations: List[float]) -> Tuple[float,float]:
    mean = sum(observations) / len(observations)
    variance = sum([(elem - mean)**2 for elem in observations]) / len(observations)
    std_dev = math.sqrt(variance)
    return mean, std_dev

In [55]:
assert get_stats([2, 4, 4, 4, 5, 5, 7, 9]) == (5.0, 2.0)
assert get_stats([1, 1, 1]) == (1.0, 0.0)
assert get_stats([0]) == (0.0, 0.0)

<a id="cross_validate"></a>
## cross_validate

This function takes folds of data to `train`, `classify`, and `evaluate`.


* **folds** List[List[List]]: The original dataset partitioned into folds (see `create_folds` above)
* **attribute_names** int: the feature names
* **hyperparameters** List: A list of hyperparameters to explore (depth limits for a decision tree, for instance)

**returns** 

Nothing is returned, but for each hyperparameter setting, the function prints out the fold number and the error rate for that fold. The mean and variance is printed across folds for each hyperparameter setting. The error rates are reported in terms of percents.

In [56]:
def cross_validate(folds, attribute_names, hyperparameters):
    for hyperparameter in hyperparameters:
        train_error, test_error  = [], []
        error_list_train, error_list_test = [], []
        for fold_index in range(len(folds)):
            training_data, test_data = create_train_test(folds, fold_index)
            tree = train(training_data=training_data, attribute_names=attribute_names, depth_limit=hyperparameter)
            y_hat_train = classify(tree, training_data)
            y_hat_test = classify(tree, test_data)
            error_rate_train = evaluate(y_hat_train, training_data)
            error_rate_test = evaluate(y_hat_test, test_data)
            error_list_train.append(error_rate_train)
            error_list_test.append(error_rate_test)
            print(f"Fold: {fold_index}\tTrain Error: {error_rate_train*100:.2f}%\tTest Error: {error_rate_test*100:.2f}%")
        print(f"***")
        print(f"Depth limit: {hyperparameter}")
        print(f"\nMean(Std. Dev.) over all folds:\n-------------------------------")
        print(f"Train Error: {get_stats(error_list_train)[0]*100:.2f}%({get_stats(error_list_train)[1]*100:.2f}%) Test Error: {get_stats(error_list_test)[0]*100:.2f}%({get_stats(error_list_test)[1]*100:.2f}%)")
        print("\n")

In [57]:
cross_validate(folds=folds_lecture, attribute_names=attribute_names_lecture, hyperparameters=[0, 1, 2, 3, 4, 5, None])

Fold: 0	Train Error: 46.15%	Test Error: 50.00%
Fold: 1	Train Error: 46.15%	Test Error: 50.00%
Fold: 2	Train Error: 46.15%	Test Error: 100.00%
Fold: 3	Train Error: 46.15%	Test Error: 50.00%
Fold: 4	Train Error: 38.46%	Test Error: 100.00%
Fold: 5	Train Error: 50.00%	Test Error: 0.00%
Fold: 6	Train Error: 42.86%	Test Error: 100.00%
Fold: 7	Train Error: 42.86%	Test Error: 100.00%
Fold: 8	Train Error: 50.00%	Test Error: 0.00%
Fold: 9	Train Error: 50.00%	Test Error: 0.00%
***
Depth limit: 0

Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 45.88%(3.53%) Test Error: 55.00%(41.53%)


Fold: 0	Train Error: 15.38%	Test Error: 50.00%
Fold: 1	Train Error: 30.77%	Test Error: 50.00%
Fold: 2	Train Error: 23.08%	Test Error: 0.00%
Fold: 3	Train Error: 15.38%	Test Error: 50.00%
Fold: 4	Train Error: 23.08%	Test Error: 0.00%
Fold: 5	Train Error: 14.29%	Test Error: 100.00%
Fold: 6	Train Error: 21.43%	Test Error: 0.00%
Fold: 7	Train Error: 21.43%	Test Error: 0.00%
Fold: 8	Train Er

<a id="pretty_print_tree"></a>
## pretty_print_tree

This function provides a text-based representation of a decision tree that is represented as a nested dictionary. 

* **dt** Dict: The decision tree as a nested dictionary
* **tab_space** Int: How much to tab successive depth levels of the resulting tree

In [58]:
def pretty_print_tree(dt, tab_space):
    for key, value in dt.items():
        if isinstance(value, dict):
            print("  " * tab_space + str(key[0]).upper() + " - " + str(key[2]) + ": ")
            print("\n")
            pretty_print_tree(value, tab_space+3)
        else:
            print("  " * tab_space + str(key[0]).upper() + " - " + str(key[2]) + " =====> " + str(value))
            print("\n")

In [59]:
dt_lecture = train(training_data=data_lecture, attribute_names=attribute_names_lecture, depth_limit=None)
pretty_print_tree(dt_lecture, tab_space=0)

SIZE - large: 


      COLOR - blue =====> no


      COLOR - green =====> yes


      COLOR - red: 


            SHAPE - round =====> yes


            SHAPE - square =====> no


SIZE - small: 


      SHAPE - square =====> no


      SHAPE - round: 


            COLOR - blue =====> no


            COLOR - red =====> yes


            COLOR - green =====> no




<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <p>
        Let's work on the mushroom data. 
    </p>
</div>

## Classify the Mushrooom data

In [60]:
folds_mushroom = create_folds(data=data_mushroom, n=10)

In [61]:
cross_validate(folds=folds_mushroom, attribute_names=attribute_names_mushroom, hyperparameters=[0, 1, 2, 3, 4, 5, None])

Fold: 0	Train Error: 48.35%	Test Error: 46.86%
Fold: 1	Train Error: 48.31%	Test Error: 47.23%
Fold: 2	Train Error: 48.16%	Test Error: 48.59%
Fold: 3	Train Error: 48.38%	Test Error: 46.62%
Fold: 4	Train Error: 48.30%	Test Error: 47.29%
Fold: 5	Train Error: 48.30%	Test Error: 47.29%
Fold: 6	Train Error: 47.93%	Test Error: 50.62%
Fold: 7	Train Error: 48.25%	Test Error: 47.78%
Fold: 8	Train Error: 48.10%	Test Error: 49.14%
Fold: 9	Train Error: 47.93%	Test Error: 50.62%
***
Depth limit: 0

Mean(Std. Dev.) over all folds:
-------------------------------
Train Error: 48.20%(0.16%) Test Error: 48.20%(1.41%)


Fold: 0	Train Error: 1.57%	Test Error: 0.62%
Fold: 1	Train Error: 1.48%	Test Error: 1.48%
Fold: 2	Train Error: 1.52%	Test Error: 1.11%
Fold: 3	Train Error: 1.44%	Test Error: 1.85%
Fold: 4	Train Error: 1.52%	Test Error: 1.11%
Fold: 5	Train Error: 1.46%	Test Error: 1.60%
Fold: 6	Train Error: 1.41%	Test Error: 2.09%
Fold: 7	Train Error: 1.49%	Test Error: 1.35%
Fold: 8	Train Error: 1.49%	Test

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <p>
        Let's work on the mushroom data. 
    </p>
</div>

## Print the Mushroom Tree

In [62]:
dt_mushroom = train(training_data=data_mushroom, attribute_names=attribute_names_mushroom, depth_limit=None)
pretty_print_tree(dt_mushroom, tab_space=0)

ODOR - f =====> p


ODOR - n: 


      SPORE-PRINT-COLOR - n =====> e


      SPORE-PRINT-COLOR - k =====> e


      SPORE-PRINT-COLOR - y =====> e


      SPORE-PRINT-COLOR - w: 


            HABITAT - w =====> e


            HABITAT - g =====> e


            HABITAT - p =====> e


            HABITAT - l: 


                  CAP-COLOR - n =====> e


                  CAP-COLOR - c =====> e


                  CAP-COLOR - y =====> p


                  CAP-COLOR - w =====> p


            HABITAT - d: 


                  GILL-SIZE - n =====> p


                  GILL-SIZE - b =====> e


      SPORE-PRINT-COLOR - b =====> e


      SPORE-PRINT-COLOR - r =====> p


      SPORE-PRINT-COLOR - h =====> e


      SPORE-PRINT-COLOR - o =====> e


ODOR - c =====> p


ODOR - s =====> p


ODOR - p =====> p


ODOR - a =====> e


ODOR - y =====> p


ODOR - l =====> e


ODOR - m =====> p




## Before You Submit...

1. Re-read the general instructions provided above, and
2. Hit "Kernel"->"Restart & Run All".