# Dave's Attempt at Mushroom Classification

## AWS Machine Learning Specialty Course 7.92

\[Pasted -v-\]

## Exercise - Mushroom Classification

In this exercise, you need to classify mushroom as edible or poisonous.

This data set is provided by UCI: https://archive.ics.uci.edu/ml/datasets/mushroom.  You can read the problem description and objective in the UCI website.

Build a classifier using XGBoost. You also need to perform data cleanup and transformation before you can train on XGBoost.

Complete Solution is available here (however, try to solve on your own):

https://github.com/ChandraLingam/AmazonSageMakerCourse/tree/master/xgboost/MushroomClassification

## Data Prep and Training in the same Notebook

### Follow iris (and, for Q&R, the solution)

### Mushroom Classification Dataset

&nbsp;&nbsp;Input features, ready for a Python list:<br/>
'cap-shape', 'cap-surface', 'cap-color', 'bruises','odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color','stalk-shape', 'stalk-root', 'stalk-surface-above-ring','stalk-surface-below-ring', 'stalk-color-above-ring','stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'

<br/>

&nbsp;&nbsp;Target:<br/>
Is the mushroom edible?  True is edible, False is poisonous<br/>
'mushroom_is_edible'

In [None]:
!pip install xgboost

In [None]:
!pip install requests

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import sys
import pathlib
import shutil
import itertools

import requests       # Might need  `pip install requests`
import zipfile as zf

import xgboost as xgb
from sklearn import preprocessing
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# For getting the data
data_url = "https://archive.ics.uci.edu/static/public/73/mushroom.zip"
data_zip_filename = "mushroom.zip"
data_filename = "mushroom_all.csv"

# For zipfile
working_dir = os.getcwd()
unzipped_dir = "mushroom_unzipped"
new_dirname = os.path.join(working_dir, unzipped_dir)
pathlib.Path(new_dirname).mkdir(parents=True, exist_ok=True)

Checked: The new directory is there. Hooray!

In [None]:
print(f"  new_dirname:\n{new_dirname}")

In [None]:
#  Get the zip - path and fact it's a zip from looking at 
#+ https://archive.ics.uci.edu/ml/datasets/mushroom
mushroom_request = requests.get(data_url, 
                                allow_redirects=True)
with open(data_zip_filename, 'wb') as fh:
    fh.write(mushroom_request.content)
##endof:  with open ... fh

Checked: the new zip archive is there. Hooray again!

Not sure about this next part, but I'm following Chandra's stuff as well as I know how.

In [None]:
zipfile_thing = zf.ZipFile("mushroom.zip")
zipfile_thing.extractall(new_dirname)

Checked: the contents of the zip are there. <strike>I've learned that `zipfile_thing.extractall(".")` isn't the way to go</strike> - that or I did something wrong on my local machine. Yeah, oops, I just tried it here on AWS, and it worked fine. It's nice to have it in another directory, though, especially since I want to change the name.

In [None]:
# bash is underneath this Notebook
!ls -lah "/home/ec2-user/SageMaker/AmazonSageMakerCourse/xgboost/dwb_Mushroom_Try_2023-07-29/mushroom_unzipped"

In [None]:
!stat "/home/ec2-user/SageMaker/AmazonSageMakerCourse/xgboost/dwb_Mushroom_Try_2023-07-29/mushroom_unzipped/expanded.Z"
print()
print("I've done some more inspection on  expanded.Z ;")
print("it's an archive, but not a zip.")
print("Inside is a text file with longer names for characteristics.")
print("I'll skip it for now.")

In [None]:
# Check for the file we want.
candidate_fname_1 = os.path.join(new_dirname, 
                                 "agaricus-lepiota.data")
candidate_fname_2 = os.path.join(new_dirname, 
                                 "agaricus-lepiota.names")


def head_filename(this_fname, n_lines_to_read=10):
    print("-" * 50)
    print(f"  First {n_lines_to_read} lines from" + 
          f"\n{this_fname}")
    print("-" * 5)
    with open(this_fname, 'r', encoding="utf-8") as fh1:
        for i in range(n_lines_to_read):
            # I won't use the i, but Q&R, whatever
            print(fh1.readline())
        ##endof:  for i in range(n_lines_to_read)
    ##endof:  with open ... fh1
    print("-" * 50)
    print()
##endof:  head_filename(<params>)

print()
head_filename(candidate_fname_1)
print()
head_filename(candidate_fname_2)

We want `agaricus-lepiota.data`. That's not a huge surprise.

In [None]:
shutil.copy(candidate_fname_1, data_filename)

In [None]:
# checking; might as well use that head_filename function
head_filename(data_filename)

In [None]:
columns = ['mushroom_is_edible', 'cap-shape', 'cap-surface', 'cap-color', 
           'bruises','odor', 'gill-attachment', 'gill-spacing', 'gill-size', 
           'gill-color','stalk-shape', 'stalk-root', 
           'stalk-surface-above-ring','stalk-surface-below-ring', 
           'stalk-color-above-ring','stalk-color-below-ring', 'veil-type', 
           'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 
           'population', 'habitat'
          ]

In [None]:
df = pd.read_csv(data_filename, names=columns)

In [None]:
print(df.head())

In [None]:
# check before encoding
df['mushroom_is_edible'].value_counts()

In [None]:
print(df['mushroom_is_edible'].value_counts())

Now, we'll need to encode the letters as numbers
```
ref_enc = "https://stackoverflow.com/questions/24458645/" + "
"label-encoding-across-multiple-columns-in-scikit-learn"
```

Which reference was found in Chandra's stuff.

In [None]:
from collections import defaultdict

d = defaultdict(preprocessing.LabelEncoder)
df = df.apply(lambda x: d[x.name].fit_transform(x))

In [None]:
print(df.head())

In [None]:
# check after encoding
df['mushroom_is_edible'].value_counts()

In [None]:
print(df['mushroom_is_edible'].value_counts())

In [None]:
# Nifty way to look at the data from Chandra
for key in d.keys():
    print(key, d[key].classes_)

And we see the nice question mark (`'?'`) in the stalk-root column that we read about from the dataset site. Those are missing values.

In [None]:
#  What we, with xgboost, and SageMaker (with its xgboost)
#+ will need, _without_ column names, plus target as 
#+ first column.
#  It seems sklearn likes to have the column names (?)
df.to_csv("mushroom_all_encoded.csv", index=False)

Let's follow `iris_data_preparation` for making our testing and validation sets.

### Training and Validation Set

### Target Variable as first column followed by input features:
mushroom_is_edible, cap-shape, cap-surface, cap-color, bruises,odor, gill-attachment, gill-spacing, gill-size, gill-color,stalk-shape, stalk-root, stalk-surface-above-ring,stalk-surface-below-ring, stalk-color-above-ring,stalk-color-below-ring, veil-type, veil-color, ring-number, ring-type, spore-print-color, population, habitat

### Training and Validation files do not have a column header

#### (when feeding into sklearn or SageMaker)

In [None]:
#  Training  =  70% of the data 
#+                 (of already-separated training; 
#+                  nothing from test set)
#  Validation = 30% of the data 
#+                 (of already-separating training;
#+                  nothing from test set)
#  We will randomize the order of the dataset entries

subj_prefix = 'mushroom'

fraction_for_training = 0.7
rnd_seed = 5

training_filename = subj_prefix + "_train.csv"
validation_filename = subj_prefix + "_validation.csv"
column_list_filename = subj_prefix + "_train_column_list.txt"

In [None]:
np.random.seed(rnd_seed)
l_shuffle = list(df.index)
np.random.shuffle(l_shuffle)
df = df.iloc[l_shuffle]

In [None]:
# numbers of entries (of rows) for each
rows = df.shape[0]
train = int(fraction_for_training * rows)
test = rows - train

In [None]:
rows, train, test

In [None]:
print(rows, train, test)

In [None]:
# Write Training Set
df[:train].to_csv(training_filename,
                  index=False, header=False,
                  columns=columns
                 )

In [None]:
# Write Validation Set
df[train:].to_csv(validation_filename,
                  index=False, header=False,
                  columns=columns
                 )

In [None]:
# Write Column List
with open(column_list_filename, 'w') as f:
    f.write(','.join(columns))
##endof:  with open ... as f

In [None]:
# Let's see what we have
!ls -lah

<br/>
<hr/>
<hr/>
<br/>

## Train a model with Mushroom data using XGBoost algorithm

### Doing the training in the same notebook as data prep

Though the reading in of files shows that it could be done separately.

### Model is trained with XGBoost, installed earlier in the notebook instance

In [None]:
column_list_file = "mushroom_train_column_list.txt"
train_file = "mushroom_train.csv"
validation_file = "mushroom_validation.csv"

In [None]:
columns = ""
with open(column_list_file, 'r') as tfh:
    columns = tfh.read().split(',')
##endof:  with open ... as tfh

In [None]:
columns

In [None]:
print(columns)

Actually, what's below isn't necessary at all in the mushroom dataset

\# **&lt;not-needed-for-mushrooms&gt;**

In [None]:
#labels=[0,1] # I'm almost positive this isn't necessary
classes = ['e', 'p']
le_2 = preprocessing.LabelEncoder()
le_2.fit(classes)

(From Chandra's notebook with the Iris dataset)

**In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.**

Doing all the tab_completions with le_2 I can.

In [None]:
le_2.classes_

In [None]:
le_2.get_params()

\# **&lt;/not-needed-for-mushrooms&gt;**

In [None]:
# Specify the column names as the file does not have a column header
df_train =      pd.read_csv(train_file,      names=columns)
df_validation = pd.read_csv(validation_file, names=columns)

In [None]:
print(df_train.head())

In [None]:
print(df_validation.head())

#### Here, we'll actually split up the dataframes as needed and train with the classifier

#### START: Optional inspection of how the dataframe gets split with iloc

In [None]:
print(df_train.head())

In [None]:
print(df_train.iloc[:,1:].head()) # All rows, columns 1 and on (zero-indexed)

In [None]:
print(df_train.iloc[:, 0].head()) #  All rows, column 0
                                  #+ The column header
                                  #+ is at the bottom.

In [None]:
df_train.iloc[:,0].ravel() # takes column zero and switches it to an array.

In [None]:
print(df_train.iloc[:,0].ravel()) # takes column zero and switches it to an array.

#### ENDOF: Optional inspection of how the dataframe gets split with iloc

In [None]:
X_train = df_train.iloc[:, 1:]
y_train = df_train.iloc[:,0].ravel()

X_validation = df_validation.iloc[:, 1:]
y_validation = df_validation.iloc[:,0].ravel()

In [None]:
# Launch a classifier
# XGBoost Training Parameter Reference:
#   https://github.com/dmlc/xgboost/blob/master/doc/parameter.md
#classifier = xgb.XGBClassifier(objective='binary:logistic'
#                               n_estimators=50)
classifier = xgb.XGBClassifier(objective='binary:logistic')

In [None]:
classifier

In [None]:
print(classifier)

In [None]:
classifier.fit(X_train,
               y_train,
               eval_set = [(X_train, y_train), (X_validation, y_validation)],
               eval_metric=['logloss'])

In [None]:
eval_result = classifier.evals_result()

In [None]:
training_rounds = range(len(eval_result['validation_0']['logloss']))

In [None]:
print(training_rounds)

In [None]:
plt.scatter(x=training_rounds,
            y=eval_result['validation_0']['logloss'],
            label="Training Error")
plt.scatter(x=training_rounds,
            y=eval_result['validation_1']['logloss'],
            label="Validation Error")
plt.grid(True)
plt.xlabel("Iterations")
plt.ylabel("LogLoss")
plt.title("Training Vs Validation Error")
plt.legend()
plt.show()

In [None]:
xgb.plot_importance(classifier)
plt.show()

In [None]:
df = pd.read_csv(validation_file, names=columns)

In [None]:
print(df.head())

### Prediction Time

In [None]:
X_test = df.iloc[:, 1:]

In [None]:
result = classifier.predict(X_test)

In [None]:
# Let's look at a few of the predictions
result[:5]  # shoot, all five are poisonous

In [None]:
# Does the end look any better?
result[-5:]  # all right, some edible ones

In [None]:
df['predicted_class'] = result

In [None]:
print(df.head())

In [None]:
print(df.tail())

In [None]:
df.mushroom_is_edible.value_counts()

In [None]:
print(df['mushroom_is_edible'].value_counts())

In [None]:
df.predicted_class.value_counts()

In [None]:
print(df['predicted_class'].value_counts())

That looks uh pretty-pretty good.

## Binary Classifier Metrics

I'm just following the patterns from the solution, rather than an earlier notebook.

In [None]:
# Reference: https://scikit-learn.org/stable/modules/model_evaluation.html
# Explicitly stating labels. Pass=1, Fail=0
def true_positive(y_true, y_pred):
    return confusion_matrix(y_true, y_pred, labels=[1, 0])[0, 0]
                         # positions in  confusion matrix -^--^-
##endof:  true_positive(y_true, y_pred)

def true_negative(y_true, y_pred):
    return confusion_matrix(y_true, y_pred, labels=[1, 0])[1, 1]
##endof:  true_negative(y_true, y_pred)

def false_positive(y_true, y_pred):
    return confusion_matrix(y_true, y_pred, labels=[1, 0])[1, 0]
##endof:  false_positive(y_true, y_pred)

def false_negative(y_true, y_pred):
    return confusion_matrix(y_true, y_pred, labels=[1, 0])[0, 1]

In [None]:
# Compute Binary Classifier Metrics
# Returns a dictionary {"MetricName":Value,...}

def binary_classifier_metrics(y_true, y_pred):
    metrics = {}

    # References: 
    #  https://docs.aws.amazon.com/machine-learning/latest/dg/binary-classification.html
    #  https://en.wikipedia.org/wiki/Confusion_matrix
    
    # Definition:
    # true positive = tp = how many samples were correctly classified as positive (count)
    # true negative = tn = how many samples were correctly classified as negative (count)
    # false positive = fp = how many negative samples were mis-classified as positive (count)
    # false_negative = fn = how many positive samples were mis-classified as negative (count)
    
    # positive = number of positive samples (count)
    #          = true positive + false negative
    # negative = number of negative samples (count)
    #          = true negative + false positive
    
    tp = true_positive(y_true, y_pred)
    tn = true_negative(y_true, y_pred)
    fp = false_positive(y_true, y_pred)
    fn = false_negative(y_true, y_pred)
    
    positive = tp + fn
    negative = tn + fp
    
    metrics['TruePositive'] = tp
    metrics['TrueNegative'] = tn
    metrics['FalsePositive'] = fp
    metrics['FalseNegative'] = fn
    
    metrics['Positive'] = positive
    metrics['Negative'] = negative
    
    # True Positive Rate (TPR, Recall) = true positive/positive
    # How many positives were correctly classified? (fraction)
    # Recall value closer to 1 is better. closer to 0 is worse
    if tp == 0:
        recall = 0
    else:
        recall = tp/positive
        
    metrics['Recall'] = recall
    
    # True Negative Rate = True Negative/negative
    # How many negatives were correctly classified? (fraction)
    # True Negative Rate value closer to 1 is better. closer to 0 is worse
    if tn == 0:
        tnr = 0
    else:
        tnr = tn/(negative)
    metrics['TrueNegativeRate'] = tnr
    
    # Precision = True Positive/(True Positive + False Positive)
    # How many positives classified by the algorithm are really positives? (fraction)
    # Precision value closer to 1 is better. closer to 0 is worse
    if tp == 0:
        precision = 0
    else:
        precision = tp/(tp + fp)
    metrics['Precision'] = precision
    
    # Accuracy = (True Positive + True Negative)/(total positive + total negative)
    # How many positives and negatives were correctly classified? (fraction)
    # Accuracy value closer to 1 is better. closer to 0 is worse
    accuracy = (tp + tn)/(positive + negative)
    metrics['Accuracy'] = accuracy
    
    # False Positive Rate (FPR, False Alarm) = False Positive/(total negative)
    # How many negatives were mis-classified as positives (fraction)
    # False Positive Rate value closer to 0 is better. closer to 1 is worse
    if fp == 0:
        fpr = 0
    else:
        fpr = fp/(negative)
    metrics['FalsePositiveRate'] = fpr
    
    # False Negative Rate (FNR, Misses) = False Negative/(total Positive)
    # How many positives were mis-classified as negative (fraction)
    # False Negative Rate value closer to 0 is better. closer to 1 is worse
    fnr = fn/(positive)
    metrics['FalseNegativeRate'] = fnr
    
    # F1 Score = harmonic mean of Precision and Recall
    # F1 Score closer to 1 is better. Closer to 0 is worse.
    if precision == 0 or recall == 0:
        f1 = 0
    else:        
        f1 = 2*precision*recall/(precision+recall)

    metrics['F1'] = f1
    
    return metrics

In [None]:
# Reference: 
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    #else:
    #    print('Confusion matrix, without normalization')

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.tight_layout()

In [None]:
# Compute confusion matrix
#{0:'edible',1:'poisonous'}) <-- WHAT!?!?!
cnf_matrix = confusion_matrix(df['mushroom_is_edible'], 
                              df['predicted_class'],
                              labels=[1, 0])

In [None]:
# Plot confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=["Poisonous", "Edible"],
                      title="Confusion Matrix")

In [None]:
# Plot confusion matrix - fractions
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=["Poisonous", "Edible"],
                      title="Confusion Matrix",
                      normalize=True)

In [None]:
metrics = [binary_classifier_metrics(df['mushroom_is_edible'], df['predicted_class'])]
df_metrics=pd.DataFrame.from_dict(metrics)
df_metrics.index = ['Model']

In [None]:
df_metrics

In [None]:
print(df_metrics)

In [None]:
print('Counts')
print(df_metrics[['TruePositive',
                  'FalseNegative',
                  'FalsePositive',
                  'TrueNegative',]].round(2))
print()
print('Fractions')
print(df_metrics[['Recall',
                  'FalseNegativeRate',
                  'FalsePositiveRate',
                  'TrueNegativeRate',]].round(2))
print()

print(df_metrics[['Precision',
                  'Accuracy',
                  'F1']].round(2))

In [None]:
print(classification_report(df['mushroom_is_edible'],
                            df['predicted_class'],
                            labels=[1, 0],
                            target_names=['Poisonous','Edible']
                           )
     )