# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

In [1]:
from copy import deepcopy
from typing import List, Dict, Tuple, Callable
import random
import math
import json

In [2]:
# getting data ready
mushroom_cols = [
    "cap-shape"
    ,"cap-surface"
    ,"cap-color"
    ,"bruises"
    ,"odor"
    ,"gill-attachment"
    ,"gill-spacing"
    ,"gill-size"
    ,"gill-color"
    ,"stalk-shape"
    ,"stalk-root"
    ,"stalk-surface-above-ring"
    ,"stalk-surface-below-ring"
    ,"stalk-color-above-ring"
    ,"stalk-color-below-ring"
    ,"veil-type"
    ,"veil-color"
    ,"ring-number"
    ,"ring-type"
    ,"spore-print-color"
    ,"population"
    ,"habitat"
    ,"edibility"
]

self_check =[[ 'Shape', 'Size', 'Color', 'Safe?'],
 ['round', 'large', 'blue', 'no'],
 [ 'square', 'large', 'red', 'no'],
 ['round', 'large', 'green', 'yes'],
 ['square', 'large', 'green', 'yes'],
 [ 'square', 'large', 'green', 'yes'],
 [ 'square', 'large', 'green', 'yes'],
 ['round', 'large', 'red', 'yes'],
 [ 'round', 'large', 'red', 'yes'],
 [ 'round', 'small', 'blue', 'no'],
 ['square', 'small', 'blue', 'no'],
 ['round', 'small', 'green', 'no'],
 [ 'square', 'small', 'green', 'no'],
 ['square', 'small', 'red', 'no'],
 [ 'square', 'small', 'red', 'no'],
 ['round', 'small', 'red', 'yes']]

<a id="parse_data"></a>
## parse_data

- Reads in a comma separated file into a nested list
- Stores the label column as the very last column
- Function mostly resued from mod 3

* **file_name** str: path to where file is located
* **class_index** int: index of where label field is in the file

**returns** List[List[]]: data stored in a nest list

In [3]:
def parse_data(file_name: str, class_index:int) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [value for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    for row in data:
        #swap
        label = row[class_index]
        row.pop(class_index)
        row.append(label)

    return data

In [4]:
# unit tests
data = parse_data("agaricus-lepiota.data",0)

# verify all observations are present
assert len(data) ==8124 

#verify all attributes and class cols are present
assert len(data[0]) == 23

# verify moved class/label col is last column
for row in data[1:]:
    assert row[0] not in ['p','e'] # first col is cap-shape, doesnt have values e or p
    assert row[-1] in ['p','e'] # label/class only takes e or p value

<a id="create_folds"></a>
## create_folds

- Resued from mod 3
- Creates folds from the data. Fold number based on parameter

* **xs** List[List[]]: list of to perform cross validation on
* **n** int: number of folds

**returns** List[List[float]]: normalized data set

In [5]:
def create_folds(xs: List[List], n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [6]:
# make unit tests
folds = create_folds(data, 10)

#verify a list is returned
assert type(folds) == list

# no other unit tests since this was used in mod 3. 

<a id="create_train_test"></a>
## create_train_test

- Mostly resused function from Mod 3
- Creates training and test data based on folds
- also for both training and test set, the column names are added to the first(index 0) row of the list

* **folds**: List[List[List]]: data to split
* **index** : index of fold for splitting
* **cols_names**: list of column names

**returns** Tuple[List[List], List[List]]: returns training and test data

In [7]:
def create_train_test(folds: List[List[List]], index: int, cols_names:List[str]) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    # add column names
    training.insert(0,cols_names)
    test_copy = deepcopy(test)
    test_copy.insert(0,cols_names)
    return training, test_copy

In [8]:
train_mush, test_mush = create_train_test(folds, 0,mushroom_cols)

# verify first row is the column names
assert train_mush[0] == test_mush[0] == mushroom_cols

# verify train is 9/10 of data
assert len(train_mush) == math.floor(len(data) * 9/10) +1 # round for the col name


# train and test should be the same size as data
assert len(test_mush) + len(train_mush) == len(data) +2 # 2 col rows

<a id="evaluate"></a>
## evaluate

- returns the error rate of the predictions

* **label** List[str]: list of actual labels for the data points
* **prediction** List[str]: list of prediction labels from the model


**returns** float: returns error rate

In [9]:
def evaluate(label:list[str], prediction:list[str])->float:
    n = len(label)
    false_vals = 0

    for i in range(n):
        if label[i] != prediction[i]:
            false_vals += 1
    return false_vals/n

In [10]:
l_test = [1,1,1,0]
p_test = [0,0,0,0]

# verify error rate is 0 if all match
zero_rate = evaluate(l_test,l_test)
assert zero_rate == 0

#verify error rate doesnt match accuracy
e_test = evaluate(l_test, p_test)
assert e_test != 1/4

# verify error rate is correct
e_test = evaluate(l_test, p_test)
assert e_test == 3/4

# verify accuracy and error rate when half data is correct
p_2 = [1,1,0,1]
e_test2 = evaluate(l_test, p_2)
accuracy = 2/4 # two correct predictions over 4 records
assert e_test2 == accuracy 

<a id="get_label_count"></a>
## get_label_count

- Gets a count of each label in a data set

* **data** List[List[]]: data used for classifying. Data is parsed
* **index** int: position of label column in data


**returns** dict: nested dictionary count per label


In [11]:
def get_label_count(data:List[List], index:-1)->dict:
    label_cnt = {}
    labels = [ rw [index] for rw in data]
    unique_labels = set(labels)

    for label in unique_labels:
        class_list = [ c for c in labels if c == label]
        label_cnt[label] = len(class_list)
    
    return label_cnt


In [12]:
label_cnt_test = get_label_count(self_check[1:],-1)

# verify get label count returns a dictionary of 2 class
assert len(label_cnt_test) == 2

# verify the value is a integer, the count
for k,v in label_cnt_test.items():
    assert isinstance(v, int)

# verify the counts are correct
assert label_cnt_test == {'yes': 7, 'no': 8}

#verify empty data doesnt cause the function to break. Emtpy dictionary should be returned
cnt_test2 = get_label_count([],1)
assert cnt_test2 == {}


<a id="get_condition_prob"></a>
## get_condition_prob

- Gets the conditional probablility of each value for a given feature

* **data** List[List[]]: data used for classifying. Data is parsed
* **col_index** int: position of feature in data
* **label_cnt** dict: count of each label value in all of data
* **smoothing** bool: indicating to use smoothing (+1) logic or not

**returns** dict: conditional probabilities for each value for a given feature. 

In [13]:
def get_condition_prob(data:List[List], col_index:int,label_cnt:dict,smoothing:bool)->dict:
    val_prob = {}
    col_values = [rw[col_index] for rw in data] 
    unique_cols = set(col_values)
    for col_val in unique_cols:
        subset = [rw for rw in data if rw[col_index] == col_val]
        subset_cnt = get_label_count(subset, -1)
        label_prob = {}
        for label in label_cnt:
            if label not in subset_cnt:
                s_cnt = 0
            else:
                s_cnt = subset_cnt[label]
            if smoothing:
                prob = (s_cnt + 1) / (label_cnt[label] + 1)
            else:
                prob = s_cnt /label_cnt[label]
            label_prob[label] = round(prob,3)
        val_prob[col_val] = label_prob
    return val_prob

In [14]:
self_check_shape = get_condition_prob(self_check[1:], 0,label_cnt_test,True)
self_check_shape

# verify result is a dictionary with unique values in column
assert len(self_check_shape) == 2

# verify the each col value has 2 values (the number of labels in data set)
for col in self_check_shape:
    assert len(self_check_shape[col]) == 2

# verify the values are as expected
assert self_check_shape == {'square': {'yes': 0.5, 'no': 0.667}, 'round': {'yes': 0.625, 'no': 0.444}}

#verify function can  handle values that dont have both labels (color=blue)
color = get_condition_prob(self_check[1:], 2,label_cnt_test,True)
assert color["blue"] == {'yes': 0.125, 'no': 0.444}
color2 = get_condition_prob(self_check[1:], 2,label_cnt_test,False)
assert color2["blue"] == {'no': 0.375, 'yes': 0}

<a id="train"></a>
## train

- Creates a Naive Bayes Classifier model
- What is returned is a dictionary probabilities used to classify a test point

* **training_data** List[List[]]: data used for classifying. Data is parsed
* **smoothing** bool: indicating to use smoothing (+1) logic or not


**returns** dict: nested dictionary probabilities 

In [15]:
def train(training_data:List[List], smoothing=True)->dict:
    probs = {}
    p_c ={}
    if len(training_data) <= 1:
        return {}
    label_cnt = get_label_count(training_data[1:],-1)
    
    # get p(c)
    for label in label_cnt:
        p_c[label] = round(label_cnt[label] / (len(training_data) - 1),3) # exclude the column names
    probs[training_data[0][-1]] = p_c

    for i in range(len(training_data[0])-1):# loop through all features
        probs[training_data[0][i]] = get_condition_prob(training_data[1:], i,label_cnt,smoothing)
    return probs


In [16]:
self_check_probs = train(self_check, smoothing=True)
self_check_probs2 = train(self_check, False)

# verify the number of keys is number of columns
assert len(self_check_probs) == len(self_check[0])
assert len(self_check_probs2) == len(self_check[0])

#verify label key is not a nested dict, but other cols are nested
for k in self_check_probs:
    for k2 in self_check_probs[k]:
        if k == "Safe?":
            assert isinstance(self_check_probs[k][k2], float)
        else:
            isinstance(self_check_probs[k][k2], dict)

#verify function returns empty dictionary if data is empty
empty_check = train([], smoothing=True)
assert empty_check == {}

<a id="get_prob_dist"></a>
## get_prob_dist

- Classifies a single observation using naive bayes classifier
- What is returned is a the label with the probability distribution

* **probs** dict: probabilities relating to the test point
* **label** str: name of the label field


**returns** Tuple[str, dict]: predicted label, probability distribution

In [17]:
def get_prob_dist(probs:dict, label:str)->Tuple[str, dict]:
    prob_dist = {}
    label_vals = set(probs[label].keys())
    col_names = set(probs.keys()) - {label}

    for l_val in label_vals:
        prod = probs[label][l_val]
        for col in col_names:
            prod *= probs[col][l_val]
        prob_dist[l_val] = prod
    
    norm = sum(prob_dist.values()) # normalize
    for k in prob_dist:
        prob_dist[k] = round((prob_dist[k] / norm), 3)
    # get arg max
    pred_label = max(prob_dist, key =prob_dist.get)
    
    return (pred_label, prob_dist)

In [18]:
test_probs = {'Safe?': {'yes': 0.467, 'no': 0.533},
 'Shape': {'yes': 0.5, 'no': 0.667},
 'Size':  {'yes': 0.875, 'no': 0.333},
 'Color': {'yes': 0.5, 'no': 0.444}}
test_prob_result = get_prob_dist(test_probs, "Safe?")

#verify tuple is returned
assert isinstance(test_prob_result,tuple)

#verify first val is yes, the predicted label
assert isinstance(test_prob_result[0], str)
assert test_prob_result[0] == "yes"

#verify second item is probability distribution
assert isinstance(test_prob_result[1], dict)

#verify probability distribution is the right size
assert len(test_prob_result[1]) == 2

<a id="classify"></a>
## classify

- Classifies each observation using the naive bayes classifier

* **nbc** dict: probabilities of the naive bayes classifier
* **observations** list[list]: data to make predictions on. Data is parsed
* **labeled** bool: indicator if observations have a label field present or not


**returns** List[Tuple[str, dict]]: list of predicted labels, probability distributions

In [19]:
def classify(nbc:dict, observations:list[list], labeled=True)->list:
    results = []
    if labeled:
        label = observations[0][-1]
        observations = [rw[:-1] for rw in observations]
        col_labels = observations[0]
    else:
        col_labels = observations[0]
        label = [k for k in nbc if k not in col_labels]
        label = label[0]
    for ob in observations[1:]:
        ob_nbc = {}
        ob_nbc[label] = nbc[label] # probs for the observation
        for col_index in range(len(ob)):
            col_key = col_labels[col_index]
            feature_prob = nbc[col_key][ob[col_index]]
            ob_nbc[col_key] = feature_prob
        prob_dist = get_prob_dist(ob_nbc, label)
        results.append(prob_dist) 
    return results

In [26]:
self_check_test_example = [self_check[0]]+ [["square","large","red", "yes"]]
test_classify = classify(self_check_probs, self_check_test_example, True)

#verify a list of tuples is returned
assert isinstance(test_classify, list)
assert isinstance(test_classify[0], tuple)

#verify the result is as expected. Matches prob returned from get_prob_dist
assert test_classify[0] == test_prob_result

# verify the same result returns if label is false
self_check_test_example2 = [['Shape', 'Size', 'Color'], ['square', 'large', 'red']]
test_classify2 = classify(self_check_probs, self_check_test_example2, False)
assert test_classify2[0] == test_prob_result

<a id="cross_validate"></a>
## cross_validate

- Performs cross validation to classify data using a naive bayes classifier
- First splits the data into 10 folds
- Then trains the model to build naives bayes classifier (probabilities)
- Then makes predictions on the test set
- Then evaulate the model
- Prints out the results for each fold

* **data** List[List[]]: data used for classifying. Data is parsed
* **col_names** List[str]: attribute names
* **smoothing** bool: indicating to use smoothing (+1) logic or not
* **labeled** bool: indicator if observations have a label field present or not


In [27]:
def cross_validate(data:List[list],col_names:list[str], smoothing=True, labeled=True):
    folds = create_folds(data, 10)
    for i in range(10):
        train_data, test_data = create_train_test(folds, i,col_names)
        nbc = train(train_data, smoothing)

        preds = classify(nbc, test_data, labeled)
        actual_labels = [row[-1] for row in test_data[1:]]
        pred_labels = [ pred[0] for pred in preds]

        error_rate = evaluate(actual_labels, pred_labels)
        error_rate = error_rate * 100
        print("Fold", i+1, "Error rate:", round(error_rate,3), "%")

In [28]:
print("Mushroom 10 Fold Cross Validation with Smoothing")
cross_validate(data,mushroom_cols, True, True)

Mushroom 10 Fold Cross Validation with Smoothing
Fold 1 Error rate: 0.369 %
Fold 2 Error rate: 0.123 %
Fold 3 Error rate: 0.246 %
Fold 4 Error rate: 0.246 %
Fold 5 Error rate: 0.0 %
Fold 6 Error rate: 0.616 %
Fold 7 Error rate: 0.123 %
Fold 8 Error rate: 0.369 %
Fold 9 Error rate: 0.493 %
Fold 10 Error rate: 0.862 %


In [29]:
print("Mushroom 10 Fold Cross Validation without Smoothing")
cross_validate(data,mushroom_cols, False, True)

Mushroom 10 Fold Cross Validation without Smoothing
Fold 1 Error rate: 0.369 %
Fold 2 Error rate: 0.123 %
Fold 3 Error rate: 0.246 %
Fold 4 Error rate: 0.123 %
Fold 5 Error rate: 0.0 %
Fold 6 Error rate: 0.493 %
Fold 7 Error rate: 0.123 %
Fold 8 Error rate: 0.369 %
Fold 9 Error rate: 0.493 %
Fold 10 Error rate: 0.985 %


## Summary of Results

- Both models the smoothing vs the non smoothing had nearly the same results. 
- The error rate is very low, but these models did not perform as well as the decision tree models (mod 8)
- The non smoothing model performed better (lower error rate) on two folds. 
- The smoothing model performed better on one fold. Otherwise, the error rate was the same between the two models. 
- The reason the non smoothing model might have performed better might be because there was no missing values in the data
    - Missing values as in the test data had values not found in the training data
    - Like test had ring_number = 2, but training only had ring_number = none and 1

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.