The following dependencies are required to run the code:
    pandas
    
NOTE: matlplotlib is an optional dependency to generate charts

We will first import our dependencies, load in the mushroom data, and split up our test and training records:

In [18]:
import pandas as pd
import numpy as np

mushroom_data = pd.read_csv('../Data/agaricus-lepiota.data')
#shuffle the data
mushroom_data = mushroom_data.iloc[np.random.permutation(len(mushroom_data))]
mushroom_data.reset_index(drop=True)

num_test = 100
training_records = mushroom_data[num_test:].values
test_records = mushroom_data[:num_test].values

# K Nearest Neighbor

In [19]:
from KNN.helpers import get_closest_k,vote_by_neighbor_weights
from KNN.knnreadable import KNNReadable

k = 5
#assign numerical values to each value so that numerical classifiers can use information
def get_value_map(unique_values):
    return {unique_values[i]: i for i in range(len(unique_values))}
    
distinct_col_vals = {col: get_value_map(mushroom_data[col].unique()) for col in mushroom_data.columns}

def get_knn_readable(v):
    classification = v[0]
    values = []
    for i in range(1,len(mushroom_data.columns)):
        col = mushroom_data.columns[i]
        col_val = v[i]
        numerical_val = distinct_col_vals[col][col_val]
        values.append(numerical_val)
    return KNNReadable(values,classification)
    
training_data = [get_knn_readable(v) for v in training_records]

test_data = [get_knn_readable(v) for v in test_records]

correct = 0
total = 0
for test_obj in test_data:
    k = len(training_data) if k == 0 else k
    closest_k = get_closest_k(test_obj,training_data,k)
    test_obj.guess = vote_by_neighbor_weights(test_obj,closest_k)

num_correct = len([td for td in test_data if td.name == td.guess])
print 'Results: %i correct out of %i' %(num_correct,num_test)

Results: 100 correct out of 100


# Decision Tree Classification

In [20]:
from DTree.helpers import *

print 'Decision Tree Classification'
print 'Computing...'

classification_attributes = [col for col in mushroom_data.columns if col != 'class']
# Create the decision tree to use for classification
train_data = mushroom_data.iloc[[i for i in range(len(mushroom_data)) if i >= num_test]]
test_data = mushroom_data.iloc[[i for i in range(len(mushroom_data)) if i < num_test]]
attr_vals = {attr: list(set(mushroom_data[attr])) for attr in classification_attributes}

decision_tree = id3(train_data,classification_attributes,attr_vals)

# loop over and compare classification from tree to actual classification
# and keep track of the number correct
test_records = [record[1] for record in test_data.iterrows()]
total_test_records = len(test_data)
num_correct = 0
index = 0
for record in test_records:
    tree = decision_tree
    guess = classify_test_case(tree,record)
    if guess == record['class']:
        num_correct += 1
    index += 1

print 'Results: %i correct out of %i' %(num_correct,total_test_records)

Decision Tree Classification
Computing...
Results: 100 correct out of 100


The following code runs the decision tree classifier with one classification attribute at a time to see how associated each field is with edibility.

In [27]:
for attr in classification_attributes:
    num_correct = 0
    index = 0
    d_tree = id3(train_data,[attr],attr_vals)
    for record in test_records:
        guess = classify_test_case(d_tree,record)
        if guess == record['class']:
            num_correct += 1
        index += 1
    l_num_correct.append(num_correct)
    print '%s: %i correct out of %i' %(attr,num_correct,index)

cap-shape: 56 correct out of 100
cap-surface: 49 correct out of 100
cap-color: 59 correct out of 100
bruises: 75 correct out of 100
odor: 100 correct out of 100
gill-attachment: 54 correct out of 100
gill-spacing: 58 correct out of 100
gill-size: 73 correct out of 100
gill-color: 78 correct out of 100
stalk-shape: 55 correct out of 100
stalk-root: 60 correct out of 100
stalk-surface-above-ring: 76 correct out of 100
stalk-surface-below-ring: 79 correct out of 100
stalk-color-above-ring: 70 correct out of 100
stalk-color-below-ring: 70 correct out of 100
veil-type: 54 correct out of 100
veil-color: 54 correct out of 100
ring-number: 54 correct out of 100
ring-type: 79 correct out of 100
spore-print-color: 85 correct out of 100
population: 71 correct out of 100
habitat: 76 correct out of 100


Classification based on odor alone was able to achieve 100% accuracy in some cases(usually between 97% and 100%). Spore print color also consistently achieved above 80% accuracy alone.  Among all fields, generally at least 10 were able to achieve over 70% accuracy alone.

This many fields that are so highly separable between classes allowed the classifiers to achieve excellent results on these datasets.

The following code investigates some of the more reliable properties for determining edibility of mushrooms.

In [14]:
class_odors = {}
for classification in mushroom_data['class'].unique():
    class_data = mushroom_data[mushroom_data['class'] == classification]
    unique_odors = class_data['odor'].unique()
    odor_freqs = class_data['odor'].value_counts()
    class_odors[classification] = odor_freqs
print class_odors

{'p': f    2160
y     576
s     576
p     256
c     192
n     120
m      36
Name: odor, dtype: int64, 'e': n    3408
a     400
l     400
Name: odor, dtype: int64}


The following structure was found for odors in each class:

    Poisonous:
            2160 foul
            576 fishy
            576 spicy
            256 pungent
            192 creosote
            120 none
            36 musty

    Edible:
            3408 none
            400 almond
            400 anise
            
The only possibly ambiguous case based on the data is when there is no smell as that is the only one that is common between poisonous and edible mushrooms.  An overwhelming majority of odorless mushrooms are classified as none.

We can inspect other fields in similar ways to get a better idea of strong identifiers of mushroom edibility.  For example we can investigate at Spore Print Color:

    Poisonous:
        1812 white
        1584 chocolate
        224 brown
        224 black
        72 green
    Edible:
        1744 brown
        1648 black
        576 white
        48 purple
        48 buff
        48 orange
        48 chocolate
        
Here we see that green seems to be an infrequent trait but exclusive to poisonous mushrooms.  We also have purple, buff, and orange being infrequent but seemingly exclusive to edible mushrooms.  For colors that are common between poisonous and edible mushrooms there seems to be an overwhelming majority belonging to one class.
    