## Mushroom Classifier

Higher stakes than hot dog/not hot dog, this is about classifying a mushroom as poisonous or edible. Data set available [here](https://archive.ics.uci.edu/ml/datasets/Mushroom)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
columns = ['class','cap-shape', 'cap-surface','cap-color', 'bruises','odor',
          'gill-attachment','gill-spacing','gill-sizing','gill-color','stalk-shape','stalk-root',
          'stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring',
          ' veil-type',' veil-color','ring-number','ring-type', 'spore-print-color',
          'population','habitat']

In [None]:
df = pd.read_csv('agaricus-lepiota.data', names=columns)

In [None]:
df.head()

Hmmm... want to use xgboost, but need numerical values for these categorical variables.

We can use https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html?highlight=labelencoder#sklearn.preprocessing.LabelEncoder

In [None]:
from sklearn import preprocessing
from collections import defaultdict
dd = defaultdict(preprocessing.LabelEncoder)

In [None]:
df = df.apply(lambda x: dd[x.name].fit_transform(x))

In [None]:
df.head()

What's the breakdown between the two classes?

In [None]:
df['class'].value_counts()

In [None]:
for key in dd.keys():
    print(key, dd[key].classes_)

Split the data into training and validation sets

In [None]:
np.random.seed(5)
l = list(df.index)
np.random.shuffle(l)
df = df.iloc[l]

In [None]:
rows = df.shape[0]
train = int(.7 * rows)
test = rows - train

In [None]:
# Write Training Set
df[:train].to_csv('mushroom_train.csv'
                          ,index=False,index_label='Row',header=False
                          ,columns=columns)

In [None]:
df[train:].to_csv('mushroom_validation.csv'
                          ,index=False,index_label='Row',header=False
                          ,columns=columns)

In [None]:
# Write Column List
with open('mushroom_train_column_list.txt','w') as f:
    f.write(','.join(columns))