# Module 9 - Programming Assignment


## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

In [123]:
from copy import deepcopy
from collections import Counter
import math
import random
from typing import List, Dict, Tuple, Callable

In [124]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [value for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [125]:
data = parse_data("agaricus-lepiota-1.data")

In [126]:
len(data)

8124

In [127]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [128]:
folds = create_folds(data, 10)

In [129]:
len(folds)

10

In [130]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

In [131]:
train, test = create_train_test(folds, 0)

In [132]:
len(train)

7311

In [133]:
len(test)

813

In [134]:
atrib = {'cap-shape':1, 
         'cap-surface':2, 
         'cap-color':3 , 
         'bruises?':4, 
         'odor':5, 
         'gill-attachment':6 , 
         'gill-spacing':7 , 
         'gill-size':8, 
         'gill-color':9, 
         'stalk-shape':10, 
         'stalk-root':11, 
         'stalk-surface-above-ring':12, 
         'stalk-surface-below-ring':13, 
         'stalk-color-above-ring':14, 
         'stalk-color-below-ring':15, 
         'veil-type':16, 
         'veil-color':17, 
         'ring-number':18, 
         'ring-type':19, 
         'spore-print-color':20, 
         'population':21, 
         'habitat':22}

<a id="count"></a>
## count
This function computes probability of each feature category per class. **Used by**: [train](#train)

* **data List[List[str]]:** A collection of sub-lists containing the data of the problem
* **smoothing bool:** flag used to smooth probabilities if true
* **Returns:** (NBC) Naive Bayes Classifier dict 

In [135]:
def count(data: list[list[str]], smoothing: bool) -> dict:
    num_features = len(data[0][1:]) 
    class_counts = Counter([row[0] for row in data])
    
    probabilities = {}
    for feature in range(1, num_features + 1):
        probabilities[feature] = {}
        for cls in class_counts:
            rows_of_class = [row for row in data if row[0] == cls] # filter rows for current class
            counter = Counter([row[feature] for row in rows_of_class]) # Count occurrences of feature
            # Convert counts to prob
            if smoothing:
                prob = {k: (v+1)/(class_counts[cls]+1) for k, v in counter.items()}
            else:
                prob = {k: v/class_counts[cls] for k, v in counter.items()}
            probabilities[feature][cls] = prob
    return probabilities, class_counts

In [136]:
data1 = [
    ["p", "x", "y"],
    ["p", "y", "z"],
    ["e", "x", "z"],
    ["e", "y", "y"]]
data2 = [
    ["p", "m", "n"],
    ["e", "n", "o"],
    ["p", "m", "p"],
    ["e", "m", "o"],
    ["e", "n", "n"]]

assert count(data1, False) == ({ 1: {"p": {"x": 0.5, "y": 0.5},"e": {"x": 0.5, "y": 0.5}},
                               2: {"p": {"y": 0.5, "z": 0.5},"e": {"z": 0.5, "y": 0.5}}}, Counter({'p': 2, 'e': 2}))
assert count(data1, True) == ({1: {"p": {"x": 0.6666666666666666, "y": 0.6666666666666666},"e": {"x": 0.6666666666666666, "y": 0.6666666666666666}},
                              2: {"p": {"y": 0.6666666666666666, "z": 0.6666666666666666},"e": {"z": 0.6666666666666666, "y": 0.6666666666666666}}}, Counter({'p': 2, 'e': 2}))
assert count(data2, False) == ({1: {"p": {"m": 1.0},"e": {"n": 0.6666666666666666, "m": 0.3333333333333333}},
                                2: {"p": {"n": 0.5, "p": 0.5},"e": {"o": 0.6666666666666666, "n": 0.3333333333333333}}}, Counter({'p': 2, 'e': 3}))

<a id="train"></a>
## train
* **training_data: List[List[str]]:** This is the list of data from which to compute Bayesian probabilities. **Used by**: [none](#none) 
* **smoothing (Optional) bool:**  A boolean parameter which determines whether or not to apply smoothing. By default, it's set to True.

* **Return Tuple[Dict,Dict]:** two dicts composed of the probabilites and class counts

In [137]:
'''
takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. 
There are many options including namedtuples and just plain old nested dictionaries
'''
def train(training_data, smoothing=True) -> Tuple[Dict,Dict]:
    return count(training_data, smoothing)

<a id="probability_of"></a>
## probability_of

un-normalized bayes probability of current class

* **instance: List[str]:** This is instance of data from which to compute Bayesian probabilities. **Used by**: [nbc](#nbc) 
* **label: str:**  class label for data set
* **value: int:** count of instance of current class
* **probs: NBC:** probabilities of trained data

* **Return float:** un-normalized probability of current class instance

In [138]:
def probability_of(instance, label, value, probs):
    result=0
    for idx, category in zip(range(1, len(instance[1:])+1), instance[1:]):
        result.append(probs[0][idx][label][category])
    result.append(value/sum(probs[1].values())) # add class prob
    return math.prod(result) 

<a id="normalize"></a>
## normalize
* **results dict:**: The unnormalized bayesian probabilities of each class. **Used by**: [nbc](#nbc) 
* **Returns dict:** A new dictionary with the same keys as the input dictionary, but the values are normalized such that they all sum up to 1.

In [139]:
def normalize(results: Dict):
    total_sum = sum(results.values())
    return {key: value / total_sum for key, value in results.items()}

<a id="nbc"></a>
## nbc
* **probs dict:** dict: The unnormalized bayesian probabilities of each class. **Used by**: [classify](#classify) 
* **instance: List[str]:** one instance of data
* **Returns Tuple:** return tuple of best and the resultng normalized probabilites 

In [140]:
def nbc(probs, instance):
    results = {}
    for label, value in probs[1].items(): # from counter dict
        results[label] = probability_of(instance, label, value, probs) # label: value = (K:v)
    results = normalize(results)
    best =  max(zip(results.values(), results.keys()))[1]# essentially argmax
    return (best, results)

In [141]:
# returns a list of tuples, the argmax and the raw data as per the pseudocode.
def classify(nbc_, observations, labeled=True):
    result =[]
    for observation in observations:
        result.append(nbc(nbc_, observation))
    return result

In [142]:
def evaluate():
'''
    takes a data set with labels (like the training set or test set) 
    and the classification result and calculates the classification error rate:
'''       
    pass


IndentationError: expected an indented block after function definition on line 1 (1830721448.py, line 2)

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.