# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from copy import deepcopy
import random
from math import log2
from typing import List, Dict, Tuple, Callable

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

# Overview

The data is given in "agaricus-lepiota.data" as a text file - each line in the text file is a single observation, with a total of 8124 observations, each with 22 attributes.

| Index | Variable                  | Description |
| ----- | -----------               | ----------- |
| 0     | **class label**           | edible=e,poisonous=p |
| 1     | cap-shape                 | bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
| 2     | cap-surface               | fibrous=f,grooves=g,scaly=y,smooth=s
| 3     | cap-color                 | brown=n,buff=b,cinnamon=c,gray=g,green=r,
|       |                           | pink=p,purple=u,red=e,white=w,yellow=y
| 4     | bruises?                  | bruises=t,no=f
| 5     | odor                      | almond=a,anise=l,creosote=c,fishy=y,foul=f,
|       |                           | musty=m,none=n,pungent=p,spicy=s
| 6     | gill-attachment           | attached=a,descending=d,free=f,notched=n
| 7     | gill-spacing              | close=c,crowded=w,distant=d
| 8     | gill-size                 | broad=b,narrow=n
| 9     | gill-color                | black=k,brown=n,buff=b,chocolate=h,gray=g,
|       |                           | green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
| 10    | stalk-shape               | enlarging=e,tapering=t
| 11    | stalk-root                | bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
| 12    | stalk-surface-above-ring  | fibrous=f,scaly=y,silky=k,smooth=s
| 13    | stalk-surface-below-ring  | fibrous=f,scaly=y,silky=k,smooth=s
| 14    | stalk-color-above-ring    | brown=n,buff=b,cinnamon=c,gray=g,orange=o,
|       |                           | pink=p,red=e,white=w,yellow=y
| 15    | stalk-color-below-ring    | brown=n,buff=b,cinnamon=c,gray=g,orange=o,
|       |                           | pink=p,red=e,white=w,yellow=y
| 16    | veil-type                 | partial=p,universal=u
| 17    | veil-color                | brown=n,orange=o,white=w,yellow=y
| 18    | ring-number               | none=n,one=o,two=t
| 19    | ring-type                 | cobwebby=c,evanescent=e,flaring=f,large=l,
|       |                           | none=n,pendant=p,sheathing=s,zone=z
| 20    | spore-print-color         | black=k,brown=n,buff=b,chocolate=h,green=r,
|       |                           | orange=o,purple=u,white=w,yellow=y
| 21    | population                | abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
| 22    | habitat                   | grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

The 11th attribute has possible missing data, denoted by a "?" - any observations with missing data will not be included in the dataset. The class label is the first element in the list.

The following three functions are taken from Programming Assignment 3 to parse data, create folds in data, and separate the data into training and test sets. The `parse_data` function is augmented to generate lists of characters rather than floats, and excludes rows with "?" characters.

<a id="parse_data"></a>
## parse_data

In [2]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [value for value in line.rstrip().split(",")]
        if "?" not in datum: data.append(datum)
    random.shuffle(data)
    return data

In [3]:
data = parse_data("agaricus-lepiota.data")
len(data) # 8124 observations, 2480 observations with missing data

5644

In [4]:
print(data[1])

['e', 'f', 's', 'g', 'f', 'n', 'f', 'w', 'b', 'k', 't', 'e', 'f', 's', 'w', 'w', 'p', 'w', 'o', 'e', 'k', 'a', 'g']


<a id="create_folds"></a>
## create_folds

In [5]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [6]:
folds = create_folds(data, 10)

In [7]:
len(folds)

10

<a id="create_train_test"></a>
## create_train_test

In [8]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

In [9]:
train, test = create_train_test(folds, 0)

In [10]:
len(train)

5079

In [11]:
len(test)

565

# Functions

## train

`train` takes data, a list of all class labels, and an optional smoothing parameter and computes probabilities for every attribute given a class label - $p(a_i | c_j)$. By counting the occurences of each attribute for a class label, and then dividing by the occurences of that class label, a probability is generated for each attribute/class-label pair. This is then used in the Naive Bayes Classifier to classify data probabilistically. The smoothing parameter determines whether the probabilities are calculated from `+ 1` smoothing, which initializes all instances of attribute/class label pairs to 1 instead of 0. This smoothing is done to minimize the impact of data where the attribute does not have an occurence for a class label (it "smooths" this data out and makes it closer to the other data points rather than at 0). The function returns a dictionary with keys as either class labels ($c$) or attribute/class label pairs ($attr\_idx, attr\_val, c$). **Used by**: [cross_validate](#cross_validate).

* **training_data**: the data used to construct the probability dict from
* **class_labels**: a list of all possible class labels
* **smoothing**: optional parameter to implement `+ 1` smoothing

**returns** `Dict`: a dict of probabilities for each attribute/class label pair.

In [12]:
def train(training_data, class_labels, smoothing=True):
    possible_tuples, nbc, counts, start_val = [], {}, {}, 1 if smoothing else 0
    for row in training_data:
        counts[row[0]] = counts[row[0]] + 1 if row[0] in counts else 1 + start_val
        for idx, attr in enumerate(row[1:]):
            for label in class_labels:
                if (idx, attr, label) not in possible_tuples: possible_tuples.append((idx, attr, label))
    counts.update({tup:start_val for tup in possible_tuples})
    for row in training_data:
        for idx, attr in enumerate(row[1:]):
            if (idx, attr, row[0]) in counts: 
                counts[(idx, attr, row[0])] += 1
            else:
                counts[(idx, attr, row[0])] = 1 + start_val
    for k,v in counts.items():
        nbc[k] = (v / counts[k[2]]) if type(k) == tuple else (v - start_val) / len(training_data)
    return nbc

In [18]:
# assertions/unit tests
data = [['n', 'ro', 'la', 'bl'], 
        ['y', 'sq', 'la', 'gr'],
        ['n', 'sq', 'sm', 'rd'],
        ['y', 'ro', 'la', 'rd'],
        ['n', 'sq', 'sm', 'bl'],
        ['n', 'ro', 'sm', 'bl'],
        ['y', 'ro', 'sm', 'rd'],
        ['n', 'sq', 'sm', 'gr'],
        ['y', 'ro', 'la', 'gr'],
        ['y', 'sq', 'la', 'gr'],
        ['n', 'sq', 'la', 'rd'],
        ['y', 'sq', 'la', 'gr'],
        ['y', 'ro', 'la', 'rd'],
        ['n', 'sq', 'sm', 'rd'],
        ['n', 'ro', 'sm', 'gr']]
nbc = train(data, ['y', 'n'], True)
# assert nbc["n"] == 8/15
# assert nbc[(2, "rd", "n")] == 4/9


# nbc = train(data, ['y', 'n'], False)
# assert nbc[(2, "rd", "n")] == (3/8)
nbc

{'n': 0.5333333333333333,
 'y': 0.4666666666666667,
 (0, 'ro', 'y'): 0.625,
 (0, 'ro', 'n'): 0.4444444444444444,
 (1, 'la', 'y'): 0.875,
 (1, 'la', 'n'): 0.3333333333333333,
 (2, 'bl', 'y'): 0.125,
 (2, 'bl', 'n'): 0.4444444444444444,
 (0, 'sq', 'y'): 0.5,
 (0, 'sq', 'n'): 0.6666666666666666,
 (2, 'gr', 'y'): 0.625,
 (2, 'gr', 'n'): 0.3333333333333333,
 (1, 'sm', 'y'): 0.25,
 (1, 'sm', 'n'): 0.7777777777777778,
 (2, 'rd', 'y'): 0.5,
 (2, 'rd', 'n'): 0.4444444444444444}

## classify_one

`classify_one` is a helper function that takes an observation, a dict of probabilities, and a list of class labels and classifies the observation. This function closely follows the pseudocode discussed in class. The function computes the probability of a given class label using the numerator of Bayes Rule:
$$ P(C | A) \propto P(A | C) * P(C) $$

We can compute this for every class label $c$ as follows:

$$ p(c | A) \propto p(c) * \prod_{i}^n p(a_i | c) $$

The function creates a dictionary of these probabilities mapped to class labels as keys, and returns a tuple: `(majority_label, probs)` where `probs` is the above dictionary. The probabiliites are normalized by dividing by the sum of probabilities for each class label. **Used by**: [classify](#classify).

* **observation**: a list of attributes representing one observation
* **nbc**: a dict of probabilities for each attribute returned by `train`
* **class_labels**: a list of all possible class labels

**returns** `Tuple`: a tuple of the majority class label and a dict of all class labels mapped to their normalized probabilities.

In [14]:
def classify_one(observation, nbc, class_labels):
    probs = {}
    for label in class_labels:
        probs[label] = nbc[label]
        for idx, attr in enumerate(observation):
            probs[label] = probs[label] * nbc[(idx, attr, label)] if (idx, attr, label) in nbc else 0
    sum_probs = sum(probs.values())
    for k,v in probs.items():
        probs[k] = v / sum_probs
    majority_class = max(probs, key=probs.get)
    probs = dict(sorted(probs.items(), key=lambda prob: prob[1], reverse=True))
    return (majority_class, probs)

In [15]:
# assertions/unit tests
data = [['n', 'ro', 'la', 'bl'], 
        ['y', 'sq', 'la', 'gr'],
        ['n', 'sq', 'sm', 'rd'],
        ['y', 'ro', 'la', 'rd'],
        ['n', 'sq', 'sm', 'bl'],
        ['n', 'ro', 'sm', 'bl'],
        ['y', 'ro', 'sm', 'rd'],
        ['n', 'sq', 'sm', 'gr'],
        ['y', 'ro', 'la', 'gr'],
        ['y', 'sq', 'la', 'gr'],
        ['n', 'sq', 'la', 'rd'],
        ['y', 'sq', 'la', 'gr'],
        ['y', 'ro', 'la', 'rd'],
        ['n', 'sq', 'sm', 'rd'],
        ['n', 'ro', 'sm', 'gr']]
class_labels = ['y', 'n']
nbc = train(data, class_labels, True)
classification = classify_one(["sq", "la", "rd"], nbc, class_labels)
assert classification[1]['y'] - (0.102 / (0.102 + 0.053)) < 0.005
assert classification[1]['n'] - (0.053 / (0.102 + 0.053)) < 0.005
assert classification[0] == 'y'

## classify

`classify` takes a list of observations and classifies each using the Naive Bayes Classifier algorithm implemented in `classify_one`. The function takes an optional parameter `labeled` that denotes whether data is labeled or unlabeled - if it is unlabeled, it appends an extra column to the front representing unknown class labels. The function returns a list of classifications for each observation in order. **Uses**: [classify_one](#classify_one). **Used by**: [cross_validate](#cross_validate).

* **nbc**: a dict of probabilities for each attribute returned from `train`.
* **observations**: a list of observations as lists of attributes
* **class_labels**: a list of all possible class labels
* **labeled**: an optional parameter denoting whether `observations` is labeled

**returns** `List`: a list of classifications as tuples returned from `classify_one`.

In [16]:
def classify(nbc, observations, class_labels, labeled=True):
    dataset = deepcopy(observations) if labeled else [[None] + deepcopy(row) for row in observations]
    classifications = []
    for row in dataset:
        classification = classify_one(row[1:], nbc, class_labels)
        classifications.append(classification)
    return classifications

In [17]:
data = [['n', 'ro', 'la', 'bl'], 
        ['y', 'sq', 'la', 'gr'],
        ['n', 'sq', 'sm', 'rd'],
        ['y', 'ro', 'la', 'rd'],
        ['n', 'sq', 'sm', 'bl'],
        ['n', 'ro', 'sm', 'bl'],
        ['y', 'ro', 'sm', 'rd'],
        ['n', 'sq', 'sm', 'gr'],
        ['y', 'ro', 'la', 'gr'],
        ['y', 'sq', 'la', 'gr'],
        ['n', 'sq', 'la', 'rd'],
        ['y', 'sq', 'la', 'gr'],
        ['y', 'ro', 'la', 'rd'],
        ['n', 'sq', 'sm', 'rd'],
        ['n', 'ro', 'sm', 'gr']]
class_labels = ['y', 'n']
nbc = train(data, class_labels, True)
obs = [["sq", "la", "rd"]]
classifications = classify(nbc, obs, class_labels, False)
assert classifications[0][0] == "y"

obs = [['n', 'ro', 'la', 'bl'],
       ['n', 'sq', 'sm', 'bl'],
       ['y', 'ro', 'la', 'gr']]

classifications = classify(nbc, obs, class_labels, True)
labels = [row[0][0] for row in classifications]
assert labels == ['n', 'n', 'y']
assert classifications[1][1]['n'] + classifications[1][1]['y'] == 1

## evaluate

`evaluate` takes a list of classifications and labeled data and returns the error rate of the classifications. Error rate is: 
$$error\_rate=\frac{errors}{n}$$
**Used by**: [cross_validate](#cross_validate).

* **labeled_data**: the real values for labels to compare to
* **classifications**: the estimates to determine error rate for

**returns** `float`: the error rate as a float.

In [18]:
def evaluate(labeled_data, classifications):
    num_errors = 0
    labels = [row[0] for row in labeled_data]
    for actual_label, classified_label in zip(labels, classifications):
        if actual_label != classified_label[0]: num_errors += 1
    return num_errors / len(classifications)

In [19]:
# assertions/unit tests
data = [['n', 'ro', 'la', 'bl'], 
        ['y', 'sq', 'la', 'gr'],
        ['n', 'sq', 'sm', 'rd'],
        ['y', 'ro', 'la', 'rd'],
        ['n', 'sq', 'sm', 'bl'],
        ['n', 'ro', 'sm', 'bl'],
        ['y', 'ro', 'sm', 'rd'],
        ['n', 'sq', 'sm', 'gr'],
        ['y', 'ro', 'la', 'gr'],
        ['y', 'sq', 'la', 'gr'],
        ['n', 'sq', 'la', 'rd'],
        ['y', 'sq', 'la', 'gr'],
        ['y', 'ro', 'la', 'rd'],
        ['n', 'sq', 'sm', 'rd'],
        ['n', 'ro', 'sm', 'gr']]
class_labels = ['y', 'n']
nbc = train(data, class_labels, True)
classifications = classify(nbc, data, class_labels, True)
assert evaluate(data, classifications) == 2/15

## cross_validate


`cross_validate` takes the data and uses 10 fold cross validation to `train`, `classify`, and `evaluate`. The function shuffles the data, splits the data into folds, and performs 10-fold cross validation on the folds. The error rate for each fold's evaluation is printed, and the average error rate is printed at the end. **Uses**: [create_folds](#create_folds), [create_train_test](#create_train_test), [train](#train), [classify](#classify), [evaluate](#evaluate).

* **data**: the dataset
* **smoothing**: a flag denoting whether to use `+ 1` smoothing in training
* **class_labels**: a list of all possible class labels
* **classify**: a function that takes a dict of probabilities, a set of observations, and a list of class labels and returns the classifications for each observation

In [20]:
def cross_validate(data, smoothing, class_labels, classify):
    avg_err_rate = 0
    random.shuffle(data)
    folds = create_folds(data, 10)
    print("Smoothing:", smoothing)
    for i in range(10):
        train_data, test_data = create_train_test(folds, i)
        nbc = train(train_data, class_labels, smoothing)
        classifications = classify(nbc, test_data, class_labels)
        error_rate = round(evaluate(test_data, classifications), 5)
        print("Fold", i, "error rate:", error_rate)
        avg_err_rate += error_rate
    avg_err_rate = round(avg_err_rate / 10, 5)
    print("Avg. error rate:", avg_err_rate)
    print()

In [21]:
data = parse_data("agaricus-lepiota.data")
cross_validate(data, True, ['e', 'p'], classify)
cross_validate(data, False, ['e', 'p'], classify)

Smoothing: True
Fold 0 error rate: 0.02478
Fold 1 error rate: 0.02655
Fold 2 error rate: 0.03894
Fold 3 error rate: 0.02655
Fold 4 error rate: 0.0266
Fold 5 error rate: 0.0195
Fold 6 error rate: 0.01064
Fold 7 error rate: 0.03546
Fold 8 error rate: 0.03369
Fold 9 error rate: 0.01418
Avg. error rate: 0.02569

Smoothing: False
Fold 0 error rate: 0.00177
Fold 1 error rate: 0.00177
Fold 2 error rate: 0.00531
Fold 3 error rate: 0.00177
Fold 4 error rate: 0.00355
Fold 5 error rate: 0.00177
Fold 6 error rate: 0.00355
Fold 7 error rate: 0.00177
Fold 8 error rate: 0.00355
Fold 9 error rate: 0.00355
Avg. error rate: 0.00284



# Results

Here, we see an average error rate of about 0.03 for smoothed data and 0.003 for unsmoothed data. The smoothing is most likely hurting the classification by giving weight to incorrect associations - since the data is well-documented, there are lots of occurences of each attribute for all possible labels. Then, smoothing the data will be erroneous, since it only serves to decrease the effect of attribute pairs with few occurences. Since the data here comes from a field manual, it stands to reason that smoothing the data is meaningless - there is no statistical noise to smooth, and smoothing only introduces more error into our observations.

In general, Naive Bayes performs well, considering it takes into account only joint conditional probabilities, and does not really "learn" in the same way a decision tree does. 

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.