# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

## load_data documentation

load_data(file_path: str) -> List[List[str]]  
    Reads a CSV file containing the mushroom data, dropping rows with any missing data, and returns a list of rows where each row is a list of feature values.  

    Parameters:  
    - file_path (str): The path to the CSV file.  

    Returns:
    - List[List[str]]: A list of lists, each inner list representing a row of feature values.  


In [1]:
import csv
from typing import List, Tuple, Dict

def load_data(file_path: str) -> List[List[str]]:
    data = []
    with open(file_path, 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            if "?" not in row:  
                data.append(row)
    return data

file_path = 'agaricus-lepiota.data'
data = load_data(file_path)


In [2]:
# Test 1: Data should be a list of lists
data = load_data(file_path)
assert isinstance(data, list), "Data should be a list"
assert all(isinstance(row, list) for row in data), "Each row should be a list"
print("load_data test 1 passed!")

# Test 2: Data should contain no missing values
assert all("?" not in row for row in data), "Data should not contain missing values"
print("load_data test 2 passed!")

# Test 3: Data should not be empty
assert len(data) > 0, "Data should not be empty"
print("load_data test 3 passed!")


load_data test 1 passed!
load_data test 2 passed!
load_data test 3 passed!


In [3]:
# IGNORE THIS CELL
# IGNORE THIS CELL
# FUNCTION REPEATED BELOW

from collections import defaultdict, Counter
from typing import NamedTuple, Callable
import numpy as np

class NaiveBayesClassifier(NamedTuple):
    class_priors: Dict[str, float]
    conditional_probs: Dict[str, Dict[int, Dict[str, float]]]


## train documentation  

train(data: List[List[str]], smoothing: bool = True) -> NaiveBayesClassifier  
    Trains a Naive Bayes Classifier on the given data, calculating prior and conditional probabilities for each class and feature value. Supports optional Laplace smoothing.  

    Parameters:  
    - data (List[List[str]]): The dataset with each row as a list of feature values. The first element in each row is the class label.  
    - smoothing (bool): If True, applies Laplace smoothing to conditional probabilities.  

    Returns:  
    - NaiveBayesClassifier: A named tuple containing class priors and conditional probabilities.  


In [4]:

def train(data: List[List[str]], smoothing: bool = True):
    class_counts = Counter(row[0] for row in data)
    total_count = len(data)
    
    class_priors = {cls: count / total_count for cls, count in class_counts.items()}
    
    conditional_probs = {cls: defaultdict(lambda: defaultdict(float)) for cls in class_counts}
    
    for cls in class_counts:
        # Filter data by class
        class_data = np.array([row for row in data if row[0] == cls])
        
        for feature_index in range(1, len(data[0])):
            feature_values, counts = np.unique(class_data[:, feature_index], return_counts=True)
            total_feature_count = counts.sum()
            unique_values = len(feature_values)  # Number of unique feature values
            
            for value, count in zip(feature_values, counts):
                if smoothing:
                    conditional_probs[cls][feature_index][value] = (count + 1) / (total_feature_count + unique_values)
                else:
                    conditional_probs[cls][feature_index][value] = count / total_feature_count

    return NaiveBayesClassifier(class_priors, conditional_probs)


In [5]:
# Test 1: Check if output is NaiveBayesClassifier
nbc = train(data, smoothing=True)
assert isinstance(nbc, NaiveBayesClassifier), "Output should be a NaiveBayesClassifier"
print("train test 1 passed!")

# Test 2: Class priors should include keys for "e" and "p"
assert "e" in nbc.class_priors and "p" in nbc.class_priors, "Class priors should include 'e' and 'p'"
print("train test 2 passed!")

# Test 3: Conditional probabilities should be dictionaries of dictionaries
assert isinstance(nbc.conditional_probs["e"], dict) and isinstance(nbc.conditional_probs["p"], dict), "Conditional probabilities should be a dictionary of dictionaries"
print("train test 3 passed!")


train test 1 passed!
train test 2 passed!
train test 3 passed!


## NaiveBayesClassifier documentation  

NaiveBayesClassifier  
    A named tuple that stores the trained Naive Bayes Classifier, containing class prior probabilities and conditional probabilities for each feature and class.  

    Attributes:  
    - class_priors (Dict[str, float]): A dictionary with classes as keys and their respective prior probabilities as values.  
    - conditional_probs (Dict[str, Dict[int, Dict[str, float]]]): A nested dictionary structure containing conditional probabilities.  
        - Outer Dict[str]: Keys are class labels (e.g., "e" or "p" for edible or poisonous).  
        - Inner Dict[int]: Keys are feature indices (1-indexed) corresponding to each feature.  
        - Innermost Dict[str, float]: Keys are feature values with their respective conditional probabilities given the class.  


In [6]:
from collections import defaultdict, Counter
from typing import NamedTuple, Callable
import numpy as np

class NaiveBayesClassifier(NamedTuple):
    class_priors: Dict[str, float]
    conditional_probs: Dict[str, Dict[int, Dict[str, float]]]


In [7]:
# Test 1: Check if NaiveBayesClassifier has required attributes
nbc = train(data, smoothing=True)
assert hasattr(nbc, "class_priors"), "NaiveBayesClassifier shoulld have a 'class_priors' attribute"
assert hasattr(nbc, "conditional_probs"), "NaiveBayesClassifier should have a 'conditional_probs' attribute"
print("NaiveBayesClassifier test 1 passed!")

# Test 2: class_priors should contain probabilities for each class ("e" and "p")
assert "e" in nbc.class_priors and "p" in nbc.class_priors, "class_priors should include 'e' and 'p'"
assert 0 <= nbc.class_priors["e"] <= 1, "'e' class prior should be between 0 and 1"
assert 0 <= nbc.class_priors["p"] <= 1, "'p' class prior should be between 0 and 1"
print("NaiveBayesClassifier test 2 passed!")

# Test 3: conditional_probs should have feature-wise probabilities
sample_feature_index = 1  # check first feature
sample_class = "e"
assert isinstance(nbc.conditional_probs[sample_class][sample_feature_index], dict), "conditional_probs should contain dictioaries of probabilities"
assert all(0 <= prob <= 1 for prob in nbc.conditional_probs[sample_class][sample_feature_index].values()), "All conditional probabilities should be between 0 and 1"
print("NaiveBayesClassifier test 3 passed!")


NaiveBayesClassifier test 1 passed!
NaiveBayesClassifier test 2 passed!
NaiveBayesClassifier test 3 passed!


## classify documentation  

classify(nbc: NaiveBayesClassifier, observations: List[List[str]], labeled: bool = True) -> List[Tuple[str, Dict[str, float]]]  
    Classifies a list of observations using the trained Naive Bayes Classifier, calculating and normalizing probabilities for each class.  

    Parameters:  
    - nbc (NaiveBayesClassifier): A trained Naive Bayes Classifier with calculated priors and conditional probabilities.  
    - observations (List[List[str]]): A list of observations to classify, where each observation is a list of feature values.  
    - labeled (bool): If True, indicates that each observation contains a label at the first index, which will be skipped in classification.  

    Returns:  
    - List[Tuple[str, Dict[str, float]]]: A list of tuples, each containing:  
        - The class with the highest probability.  
        - A dictionary of normalized probabilities for each class.  


In [8]:

def classify(nbc: NaiveBayesClassifier, observations: List[List[str]], labeled: bool = True) -> List[Tuple[str, Dict[str, float]]]:
    results = []
    for observation in observations:
        if labeled:
            observation = observation[1:]  

        class_probs = {}
        for cls in nbc.class_priors:
            prob = nbc.class_priors[cls]  
            
            for feature_index, feature_value in enumerate(observation, start=1):
                prob *= nbc.conditional_probs[cls][feature_index].get(feature_value, 1e-6)
            class_probs[cls] = prob

        total_prob = np.sum(list(class_probs.values()))
        normalized_probs = {cls: prob / total_prob for cls, prob in class_probs.items()}
        
        best_class = max(normalized_probs, key=normalized_probs.get)
        results.append((best_class, normalized_probs))
    
    return results



In [9]:
# Test 1: classify should return a list of tuples
sample_observations = data[:5]
classify_results = classify(nbc, sample_observations, labeled=True)
assert isinstance(classify_results, list), "Classsify result should be a list"
assert all(isinstance(result, tuple) for result in classify_results), "Each result should be a tuple"
print("classify test 1 passed!")

# Test 2: Each tuple should contain a best class (str) and probability dict
assert all(isinstance(result[0], str) for result in classify_results), "Each best class should be a string"
assert all(isinstance(result[1], dict) for result in classify_results), "Each probability should be a dict"
print("classify test 2 passed!")

# Test 3: Probability dictionary should sum to 1 (or close due to rounding)
for _, prob_dict in classify_results:
    assert np.isclose(sum(prob_dict.values()), 1.0), "Probabilities should sum to 1"
print("classify test 3 passed!")


classify test 1 passed!
classify test 2 passed!
classify test 3 passed!


## evaluate documentation  

evaluate(predictions: List[Tuple[str, Dict[str, float]]], true_labels: List[str]) -> float  
    Computes the error rate by comparing predicted labels with true labels in a labeled dataset.  

    Parameters:  
    - predictions (List[Tuple[str, Dict[str, float]]]): A list of tuples containing the predicted class and class probabilities for each observation.  
    - true_labels (List[str]): The true class labels for each observation.  
 
    Returns:  
    - float: The classification error rate as a percentage.  


In [10]:

def evaluate(predictions: List[Tuple[str, Dict[str, float]]], true_labels: List[str]) -> float:
    errors = np.sum([pred[0] != true for pred, true in zip(predictions, true_labels)])
    error_rate = errors / len(true_labels)
    return error_rate * 100  



In [11]:
# Test 1: Evaluate should return a float
true_labels = [row[0] for row in sample_observations]
error_rate = evaluate(classify_results, true_labels)
assert isinstance(error_rate, float), "Error rate should be a float"
print("evaluate test 1 passed!")

# Test 2: Error rate should be between 0 and 100
assert 0 <= error_rate <= 100, "Error rate should be between 0 and 100"
print("evaluate test 2 passed!")

# Test 3: perfect classification should yield 0% error rate
perfect_predictions = [(label, {"e": 1.0 if label == "e" else 0.0, "p": 1.0 if label == "p" else 0.0}) for label in true_labels]
assert evaluate(perfect_predictions, true_labels) == 0, "Perfect classification should have 0% error rate"
print("evaluate test 3 passed!")


evaluate test 1 passed!
evaluate test 2 passed!
evaluate test 3 passed!


## cross_validate documentation  

cross_validate(data: List[List[str]], folds: int = 10, smoothing: bool = True) -> None  
    Performs 10-fold cross-validation on the dataset, training and testing the Naive Bayes Classifier on each fold. Prints the error rate for each fold, the average error rate, and variance.  

    Parameters:  
    - data (List[List[str]]): The dataset with each row as a list of feature values. The first element in each row is the class label.  
    - folds (int): The number of cross-validation folds (default is 10).  
    - smoothing (bool): If True, applies Laplace smoothing to conditional probabilities during training.  

    Returns:  
    - None  


In [12]:

def cross_validate(data: List[List[str]], folds: int = 10, smoothing: bool = True) -> None:
    data_np = np.array(data)
    np.random.shuffle(data_np)
    fold_size = len(data_np) // folds
    error_rates = []
    
    for fold in range(folds):
        test_data = data_np[fold * fold_size : (fold + 1) * fold_size]
        train_data = np.concatenate((data_np[:fold * fold_size], data_np[(fold + 1) * fold_size:]))
        
        nbc = train(train_data.tolist(), smoothing=smoothing)
        
        true_labels = test_data[:, 0]
        predictions = classify(nbc, test_data.tolist(), labeled=True)
        
        error_rate = evaluate(predictions, true_labels.tolist())
        error_rates.append(error_rate)
        
        print(f"Fold {fold + 1}: Error Rate = {error_rate:.2f}%")
    
    avg_error_rate = np.mean(error_rates)
    error_variance = np.var(error_rates)
    print(f"\nAverage Error Rate: {avg_error_rate:.2f}%")
    print(f"Variance in Error Rate: {error_variance:.4f}")



In [13]:
# Test 1: CCross-validation runs without error 
print("cross_validate test 1: Cross-validation execution check")
cross_validate(data, folds=10, smoothing=True)
print("cross_validate test 1 passed!")

# Test 2: Average error rate should be between 0 and 100
error_rates = []
for fold in range(10):
    test_data = data[fold * (len(data) // 10): (fold + 1) * (len(data) // 10)]
    train_data = data[:fold * (len(data) // 10)] + data[(fold + 1) * (len(data) // 10):]
    nbc = train(train_data, smoothing=True)
    predictions = classify(nbc, test_data, labeled=True)
    true_labels = [row[0] for row in test_data]
    error_rates.append(evaluate(predictions, true_labels))

average_error_rate = np.mean(error_rates)
assert 0 <= average_error_rate <= 100, "Average error rate should be between 0 and 100"
print("cross_validate test 2 passed!")

# Test 3: Variance of error rates should be non-negative
variance_error_rate = np.var(error_rates)
assert variance_error_rate >= 0, "Variance should be non-negative"
print("cross_validate test 3 passed!")


cross_validate test 1: Cross-validation execution check
Fold 1: Error Rate = 0.53%
Fold 2: Error Rate = 0.00%
Fold 3: Error Rate = 0.35%
Fold 4: Error Rate = 0.00%
Fold 5: Error Rate = 0.18%
Fold 6: Error Rate = 0.00%
Fold 7: Error Rate = 0.53%
Fold 8: Error Rate = 1.06%
Fold 9: Error Rate = 0.00%
Fold 10: Error Rate = 0.18%

Average Error Rate: 0.28%
Variance in Error Rate: 0.1081
cross_validate test 1 passed!
cross_validate test 2 passed!
cross_validate test 3 passed!


In [14]:
# Perform cross-validation with smoothing=True 
print("Cross-Validation with Smoothing:")
cross_validate(data, smoothing=True)



Cross-Validation with Smoothing:
Fold 1: Error Rate = 0.53%
Fold 2: Error Rate = 0.18%
Fold 3: Error Rate = 0.00%
Fold 4: Error Rate = 0.53%
Fold 5: Error Rate = 0.00%
Fold 6: Error Rate = 0.18%
Fold 7: Error Rate = 0.18%
Fold 8: Error Rate = 0.18%
Fold 9: Error Rate = 0.71%
Fold 10: Error Rate = 0.35%

Average Error Rate: 0.28%
Variance in Error Rate: 0.0516


In [15]:
# Perform cross-validation with smoothing=False
print("\nCross-Validation without Smoothing:")
cross_validate(data, smoothing=False)



Cross-Validation without Smoothing:
Fold 1: Error Rate = 0.53%
Fold 2: Error Rate = 0.00%
Fold 3: Error Rate = 0.53%
Fold 4: Error Rate = 0.00%
Fold 5: Error Rate = 0.35%
Fold 6: Error Rate = 0.00%
Fold 7: Error Rate = 0.18%
Fold 8: Error Rate = 0.53%
Fold 9: Error Rate = 0.35%
Fold 10: Error Rate = 0.18%

Average Error Rate: 0.27%
Variance in Error Rate: 0.0456


With smoothing, the model demonstrates slightly higher consistency across different folds, as indicated by a broader spread of error rates and a slightly higher variance. Smoothing tends to stabilize the model by avoiding zero probabilities for rare feature occurrences, allowing it to generalize better on diverse subsets. So, this approach leads to some variability in the error rates across folds, as the model balances predictions for less common features.

Without smoothing, the model’s error rates appear more clustered, indicating less fluctuation in performance across folds. However, the absence of smoothing means the model might predict certain observations incorrectly if it encounters feature combinations it hasn’t seen in training, especially in smaller folds. The result is a a more uniform but slightly less robust model as seen by the slightly lower variance but similar average performance.

All in all, smoothing provides a more adaptable model withh slight trade-offs in variability, while the model without smoothing offers stable but less adaptable error rates across folds.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.