# Module 8 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from copy import deepcopy
import random

## Decision Trees

For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

**Just in case** the UCI repository is down, which happens from time to time, I have included the data and name files on Canvas.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>

One of the things we did not talk about in the lectures was how to deal with missing values. There are two aspects of the problem here. What do we do with missing values in the training data? What do we do with missing values when doing classifcation?

There are a lot of different ways that we can handle this.
A common algorithm is to use something like kNN to impute the missing values.
We can use conditional probability as well.
There are also clever modifications to the Decision Tree algorithm itself that one can make.

We're going to do something simpler, given the size of the data set: remove the observations with missing values ("?").

You must implement the following functions:

`train` takes training_data and returns the Decision Tree as a data structure.

```
def train(training_data):
   # returns the Decision Tree.
```

`classify` takes a tree produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data).

```
def classify(tree, observations):
    # returns a list of classifications
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 5x2 fold cross validation (from Module 2!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application).

Following Module 2's material (course notes), `cross_validate` should print out a table in exactly the same format. What you are looking for here is a consistent evaluation metric cross the folds. Print the error rate to 4 decimal places. **Do not convert to a percentage.**

```
def pretty_print_tree(tree):
    # pretty prints the tree
```

This should be a text representation of a decision tree trained on the entire data set (no train/test).

To summarize...

Apply the Decision Tree algorithm to the Mushroom data set using 5x2 cross validation and the error rate as the evaluation metric. When you are done, apply the Decision Tree algorithm to the entire data set and print out the resulting tree.

**Note** Because this assignment has a natural recursive implementation, you should consider using `deepcopy` at the appropriate places.


### Provided Functions

With n fold cross validation, we divide our data set into n subgroups called "folds" and then use those folds for training and testing. You pick n based on the size of your data set. If you have a small data set--100 observations--and you used n=10, each fold would only have 10 observations. That's probably too small. You want at least 30. At the other extreme, we generally don't use n > 10.

With 1,030 observations, n = 10 is fine so we will have 10 folds. create_folds will take a list (xs) and split it into n equal folds with each fold containing one-tenth of the observations.

You do not need to document these. 

You can use this function to read the data file.

In [2]:
def parse_data(file_name: str) -> list[list]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = line.rstrip().split(",")
        data.append(datum)
    random.shuffle(data)
    return data

You can use this function to create 10 folds for 5x2 cross validation.

In [3]:
def create_folds(xs: list, n: int) -> list[list[list]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

We always use one of the n folds as a test set and the remaining folds as a training set.
We need a function that'll take our n folds and return the train and test sets:

In [4]:
def create_train_test(folds: list[list[list]], index: int) -> tuple[list[list], list[list]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

Put your code after this line:

-----

In [5]:
# Load the dataset
data = parse_data("agaricus-lepiota.data")

# Check for any missing values 
missing_values_count = sum(1 for row in data if '?' in row)

len(data), missing_values_count  

(8124, 2480)

In [6]:
# Filter out rows with missing values
cleaned_data = [row for row in data if '?' not in row]

# Check the length 
len(cleaned_data)  


5644

## calculate_entropy Documentation

Calculates the entropy of the target variable in the dataset, which is a measure of the dataset's disorder or uncertainty.

Parameters:
- data (list): The dataset for which the entropy is calculated.
- target_index (int): The index of the target feature (class label).

Returns:
- float: The entropy value of the target variable, representing the uncertainty or impurity within the data.


In [7]:
import numpy as np
from scipy.stats import entropy

def calculate_entropy(data, target_index):
    labels, counts = np.unique([row[target_index] for row in data], return_counts=True)
    probabilities = counts / counts.sum()
    return entropy(probabilities, base=2)

In [8]:
data = [['e'], ['e'], ['p'], ['p']]

# Test if entropy of a mixed dataset is greater than 0
assert calculate_entropy(data, 0) > 0, "Test failed: Entropy for mixed data should be greater than 0"
print("Test 1 passed: Entropy for mixed data is greater than 0.")

# Test if entropy of a pure dataset is 0
pure_data = [['e'], ['e'], ['e']]
assert calculate_entropy(pure_data, 0) == 0, "Test failed: Entropy for pure data should be 0"
print("Test 2 passed: Entropy for pure data is 0.")

# Test if entropy is calculated for single feature
assert isinstance(calculate_entropy(data, 0), float), "Test failed: Entropy should be a float value"
print("Test 3 passed: Entropy is calculated as a float.")


Test 1 passed: Entropy for mixed data is greater than 0.
Test 2 passed: Entropy for pure data is 0.
Test 3 passed: Entropy is calculated as a float.


## split_data Documentation

Splits the dataset based on a given feature index, partitioning the dataset into groups according to unique feature values.

Parameters:
- data (list): The dataset to be split.
- feature_index (int): The index of the feature to split on.

Returns:
- dict: A dictionary where the keys are unique feature values, and the values are the corresponding subsets of the dataset.


In [9]:
def split_data(data, feature_index):
    unique_values = np.unique([row[feature_index] for row in data])
    splits = {value: [] for value in unique_values}
    for row in data:
        splits[row[feature_index]].append(row)
    return splits

In [10]:
data = [['e', 'round'], ['p', 'square'], ['e', 'round'], ['p', 'square']]

# Test if data is split into correct categories
splits = split_data(data, 1)
assert 'round' in splits and 'square' in splits, "Test failed: Splits do not contain expected categories"
print("Test 1 passed: Data is split into correct categories.")

# Test if the number of splits is correct
assert len(splits) == 2, "Test failed: Number of splits should match unique feature values"
print("Test 2 passed: Number of splits is correct.")

# Test if each split contains appropriate number of rows
assert len(splits['round']) == 2, "Test failed: 'round' category should have 2 rows"
print("Test 3 passed: 'round' category has the correct number of rows.")



Test 1 passed: Data is split into correct categories.
Test 2 passed: Number of splits is correct.
Test 3 passed: 'round' category has the correct number of rows.


## best_split Documentation

Finds the best feature to split the dataset by calculating information gain from all features and selecting the one with the highest gain.

Parameters:
- data (list): The dataset to be evaluated.
- target_index (int): The index of the target feature (class label).

Returns:
- int: The index of the feature that results in the highest information gain when used to split the dataset.


In [11]:
def best_split(data, target_index):
    base_entropy = calculate_entropy(data, target_index)
    num_features = len(data[0]) - 1
    best_gain = 0
    best_feature = None
    
    for i in range(num_features + 1):
        if i == target_index:
            continue  # Skip the target feature
        splits = split_data(data, i)
        weighted_entropy = sum((len(subset) / len(data)) * calculate_entropy(subset, target_index) for subset in splits.values())
        gain = base_entropy - weighted_entropy
        
        if gain > best_gain:
            best_gain = gain
            best_feature = i
            
    return best_feature


In [12]:
data = [['e', 'round', 'small'], ['p', 'square', 'large'], ['e', 'round', 'small'], ['p', 'square', 'large']]

# Test if the function returns a valid feature index
best_feature = best_split(data, 0)
assert isinstance(best_feature, int), "Test failed: The best feature index should be an integer"
print("Test 1 passed: The best feature index is an integer.")

# Test if the returned feature index is within the correct range
assert 0 <= best_feature < len(data[0]) - 1, "Test failed: Feature index should be within valid range"
print("Test 2 passed: The feature index is within valid range.")

# Test if the selected best feature is either Feature 1 (shape) or Feature 2 (size), based on the dataset used
assert best_feature in [1, 2], "Test failed: The selected best feature should be either Feature 1 or Feature 2"
print("Test 3 passed: The best feature is either Feature 1 or Feature 2.")



Test 1 passed: The best feature index is an integer.
Test 2 passed: The feature index is within valid range.
Test 3 passed: The best feature is either Feature 1 or Feature 2.


## train Documentation

Recursively builds a decision tree using the ID3 algorithm based on entropy and information gain. The tree is represented as a nested dictionary.

Parameters:
- data (list): The dataset used to train the decision tree.
- target_index (int, optional): The index of the target feature (default is 0).

Returns:
- dict or str: A dictionary representing the decision tree, where internal nodes represent feature splits, or a string representing a class label if the node is a leaf.


In [13]:
def train(data, target_index=0):
    labels = set(row[target_index] for row in data)
    
    # Base cases: if all data have the same label or no other features to split on
    if len(labels) == 1:
        return list(labels)[0]
    if len(data[0]) == 1:  # Only the target label remains
        return None
    
    # Determine the best feature to split, excluding the target
    best_feature = best_split(data, target_index)
    tree = {best_feature: {}}
    splits = split_data(data, best_feature)
    
    for feature_value, subset in splits.items():
        subtree = train([row[:best_feature] + row[best_feature + 1:] for row in subset], target_index)
        tree[best_feature][feature_value] = subtree
        
    return tree


In [14]:
data = [['e', 'round'], ['p', 'square'], ['e', 'round'], ['p', 'square']]

# Test if the tree is trained without errors
tree = train(data)
assert isinstance(tree, dict), "Test failed: The tree should be a dictionary"
print("Test 1 passed: Tree is trained successfully and is a dictionary.")

# Test if the tree contains a valid feature as the root (any feature index within range)
# Ensure the tree's root feature is a valid feature (not the target class label)
root_feature = list(tree.keys())[0]
assert 1 <= root_feature < len(data[0]), "Test failed: The root should split on a valid feature, excluding the class label"
print("Test 2 passed: Tree contains a valid root feature that is not the class label.")


root_feature = list(tree.keys())[0]

# Test if the tree contains valid leaf nodes (leaves should be strings)
assert all(isinstance(val, str) for val in tree[root_feature].values()), "Test failed: Leaf nodes should be strings"
print("Test 3 passed: Leaf nodes are valid.")



Test 1 passed: Tree is trained successfully and is a dictionary.
Test 2 passed: Tree contains a valid root feature that is not the class label.
Test 3 passed: Leaf nodes are valid.


## classify Documentation

Classifies a given observation using the decision tree generated by the `train` function.

Parameters:
- tree (dict): The decision tree used for classification.
- observation (list): A single observation (data point) to classify.

Returns:
- str: The predicted class label for the observation.


In [15]:
def classify(tree, observation):
    if not isinstance(tree, dict):
        return tree  # Leaf node
    feature = list(tree.keys())[0]
    feature_value = observation[feature]
    if feature_value in tree[feature]:
        return classify(tree[feature][feature_value], observation)
    else:
        return None 


In [16]:
tree = {1: {'round': 'e', 'square': 'p'}}
observation = ['e', 'round']

# Test if the classifier returns the correct label
assert classify(tree, observation) == 'e', "Test failed: Classification should return 'e' for round"
print("Test 1 passed: Classification returns 'e' for round.")

# Test classification of a different observation
observation_2 = ['p', 'square']
assert classify(tree, observation_2) == 'p', "Test failed: Classification should return 'p' for square"
print("Test 2 passed: Classification returns 'p' for square.")

# Test if the function returns None for missing branch
observation_3 = ['p', 'triangle']
assert classify(tree, observation_3) is None, "Test failed: Classification should return None for unknown category"
print("Test 3 passed: Classification returns None for unknown category.")


Test 1 passed: Classification returns 'e' for round.
Test 2 passed: Classification returns 'p' for square.
Test 3 passed: Classification returns None for unknown category.


## evaluate Documentation

Calculates the classification error rate, which is the proportion of misclassified instances.

Parameters:
- actual (list): The actual class labels.
- predicted (list): The predicted class labels.

Returns:
- float: The error rate, calculated as the ratio of misclassified instances to the total number of instances.


In [17]:
def evaluate(actual, predicted):
    errors = sum(1 for a, p in zip(actual, predicted) if a != p)
    return errors / len(actual)

In [18]:
actual = ['e', 'p', 'e', 'p']
predicted = ['e', 'p', 'p', 'p']

# Test if the error rate is calculated correctly
assert evaluate(actual, predicted) == 0.25, "Test failed: Error rate should be 0.25"
print("Test 1 passed: Error rate is calculated correctly.")

# Test if error rate is 0 for perfect prediction
predicted_perfect = ['e', 'p', 'e', 'p']
assert evaluate(actual, predicted_perfect) == 0.0, "Test failed: Error rate should be 0 for perfect prediction"
print("Test 2 passed: Error rate is 0 for perfect prediction.")

# Test if the error rate is a float
assert isinstance(evaluate(actual, predicted), float), "Test failed: Error rate should be a float"
print("Test 3 passed: Error rate is a float.")


Test 1 passed: Error rate is calculated correctly.
Test 2 passed: Error rate is 0 for perfect prediction.
Test 3 passed: Error rate is a float.


## pretty_print_tree Documentation

Recursively prints the decision tree in a human-readable format, showing each feature and the corresponding split.

Parameters:
- tree (dict): The decision tree to be printed.
- depth (int, optional): The depth of the current node, used for indentation (default is 0).


In [19]:
def pretty_print_tree(tree, depth=0):
    if not isinstance(tree, dict):
        print("  " * depth + f"Leaf: {tree}")
    else:
        feature = list(tree.keys())[0]  # Get the feature being split on
        for feature_value, subtree in tree[feature].items():
            print("  " * depth + f"Feature {feature} = {feature_value}:")  # Print the current node
            pretty_print_tree(subtree, depth + 1)  # Recursively print subtrees



In [20]:
tree = {1: {'round': 'e', 'square': 'p'}}

# Test if tree is printed without errors
try:
    pretty_print_tree(tree)
    print("Test 1 passed: Tree printed successfully.")
except Exception as e:
    assert False, f"Test failed: pretty_print_tree raised an exception: {e}"

# Test if the tree has correct depth
assert len(tree[1]) == 2, "Test failed: Tree should have two branches for 'round' and 'square'"
print("Test 2 passed: Tree has correct branches.")

# Test if the leaves are strings
assert all(isinstance(val, str) for val in tree[1].values()), "Test failed: Tree leaves should be strings"
print("Test 3 passed: Tree leaves are strings.")


Feature 1 = round:
  Leaf: e
Feature 1 = square:
  Leaf: p
Test 1 passed: Tree printed successfully.
Test 2 passed: Tree has correct branches.
Test 3 passed: Tree leaves are strings.


## create_folds Documentation

Splits the dataset into `n` random folds for cross-validation using `numpy`'s random permutation.

Parameters:
- xs (list): The dataset to be split into folds.
- n (int): The number of folds to create.

Returns:
- list: A list of folds, where each fold is a subset of the dataset.


In [21]:
def create_folds(xs: list, n: int) -> list[list]:
    fold_size = len(xs) // n
    indices = np.random.permutation(len(xs))  # Generate random index array
    folds = [xs[int(i * fold_size):int((i + 1) * fold_size)] for i in range(n)]
    if len(xs) % n != 0:
        remainder = xs[int(n * fold_size):]
        folds[-1].extend(remainder)
    return folds

In [22]:
data = [['e'], ['p'], ['e'], ['p'], ['e']]

# Test if the correct number of folds is created
folds = create_folds(data, 2)
assert len(folds) == 2, "Test failed: There should be 2 folds"
print("Test 1 passed: Correct number of folds created.")

# Test if the data is evenly distributed in folds
assert len(folds[0]) == 2 and len(folds[1]) == 3, "Test failed: Data should be split into correct fold sizes"
print("Test 2 passed: Data is evenly distributed in folds.")

# Test if folds contain the expected data
assert all(isinstance(fold, list) for fold in folds), "Test failed: Folds should contain lists"
print("Test 3 passed: Folds contain lists.")


Test 1 passed: Correct number of folds created.
Test 2 passed: Data is evenly distributed in folds.
Test 3 passed: Folds contain lists.


## create_train_test Documentation

Combines folds into a training set and isolates one fold as the test set for cross-validation.

Parameters:
- folds (list): The list of folds generated from the dataset.
- index (int): The index of the fold to be used as the test set.

Returns:
- tuple: A tuple containing the training set (list) and the test set (list).


In [23]:
def create_train_test(folds: list[list], index: int) -> tuple[list, list]:
    test = folds[index]
    train = [item for i, fold in enumerate(folds) if i != index for item in fold]
    return train, test

In [24]:
folds = [[['e'], ['p']], [['e'], ['p'], ['e']]]

# Test if train and test sets are created correctly
training, test = create_train_test(folds, 1)

# Test 1: Check if the training set has the correct number of rows
assert len(training) == 2, "Test failed: Train set should have 2 rows"
print("Test 1 passed: Train set has the correct number of rows.")

# Test 2: Check if the test set has the correct number of rows
assert len(test) == 3, "Test failed: Test set should have 3 rows"
print("Test 2 passed: Test set has the correct number of rows.")

# Test 3: Check if train and test sets contain valid data (lists)
assert all(isinstance(row, list) for row in training + test), "Test failed: Train and test sets should contain lists"
print("Test 3 passed: Train and test sets contain valid data.")


Test 1 passed: Train set has the correct number of rows.
Test 2 passed: Test set has the correct number of rows.
Test 3 passed: Train and test sets contain valid data.


## cross_validate Documentation

Performs k-fold cross-validation on the dataset using the decision tree classifier and returns the error rate for each fold.

Parameters:
- data (list): The dataset to be used for cross-validation.
- folds (int, optional): The number of folds to use for cross-validation (default is 5).
- target_index (int, optional): The index of the target feature (default is 0).

Returns:
- list: A list of error rates for each fold.


In [25]:
def cross_validate(data, folds=5, target_index=0):
    fold_data = create_folds(data, folds)
    error_rates = []
    
    for i in range(folds):
        train_data, test_data = create_train_test(fold_data, i)
        tree = train(train_data, target_index)
        actual = [row[target_index] for row in test_data]
        predictions = [classify(tree, row) for row in test_data]
        error_rate = evaluate(actual, predictions)
        error_rates.append(error_rate)
        print(f"Fold {i + 1}, Error rate: {error_rate:.4f}")

    return error_rates


In [26]:
# Run cross-validation on a sample dataset
data = [['e', 'round'], ['p', 'square'], ['e', 'round'], ['p', 'square']]

# Test 1: Check if cross-validation returns a list of error rates
error_rates = cross_validate(data, folds=2)
assert isinstance(error_rates, list), "Test failed: cross_validate should return a list of error rates"
print("Test 1 passed: cross_validate returns a list of error rates.")

# Test 2: Check if the number of error rates matches the number of folds
assert len(error_rates) == 2, "Test failed: The number of error rates should match the number of folds"
print("Test 2 passed: The number of error rates matches the number of folds.")

# Test 3: Check if each error rate is a float
assert all(isinstance(rate, float) for rate in error_rates), "Test failed: Each error rate should be a float"
print("Test 3 passed: Each error rate is a float.")


Fold 1, Error rate: 0.0000
Fold 2, Error rate: 0.0000
Test 1 passed: cross_validate returns a list of error rates.
Test 2 passed: The number of error rates matches the number of folds.
Test 3 passed: Each error rate is a float.


In [27]:
cross_validate(cleaned_data, folds=5)

final_tree = train(cleaned_data)
pretty_print_tree(final_tree)

Fold 1, Error rate: 0.4973
Fold 2, Error rate: 0.4982
Fold 3, Error rate: 0.4973
Fold 4, Error rate: 0.4938
Fold 5, Error rate: 0.4726
Feature 5 = a:
  Leaf: e
Feature 5 = c:
  Leaf: p
Feature 5 = f:
  Leaf: p
Feature 5 = l:
  Leaf: e
Feature 5 = m:
  Leaf: p
Feature 5 = n:
  Feature 19 = k:
    Leaf: e
  Feature 19 = n:
    Leaf: e
  Feature 19 = r:
    Leaf: p
  Feature 19 = w:
    Feature 3 = c:
      Leaf: e
    Feature 3 = g:
      Leaf: e
    Feature 3 = n:
      Leaf: e
    Feature 3 = p:
      Leaf: e
    Feature 3 = w:
      Leaf: p
    Feature 3 = y:
      Leaf: p
Feature 5 = p:
  Leaf: p


I fixed the problem of including the class label as a feature, but my model has low predictive power and I don't know why. It's almost as likely to be incorrect as correct.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.