# Classification Of Mushroom Data

[Dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom)

## Dealing With Dependencies

In [1]:
!pip install -U pandas numpy pandas_profiling[notebook] sklearn catboost seaborn matplotlib

Collecting pandas
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
     |████████████████████████████████| 11.3 MB 4.9 MB/s            
[?25hCollecting numpy
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
     |████████████████████████████████| 15.7 MB 88.9 MB/s            
[?25hCollecting pandas_profiling[notebook]
  Downloading pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
     |████████████████████████████████| 261 kB 53.5 MB/s            
[?25hCollecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting catboost
  Downloading catboost-1.0.4-cp37-none-manylinux1_x86_64.whl (76.1 MB)
     |████████████████████████████████| 76.1 MB 140 kB/s             
[?25hCollecting seaborn
  Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)
     |████████████████████████████████| 292 kB 62.5 MB/s            
[?25hCollecting mat

## Dealing With Google Colab Issues

In [2]:
#from google.colab import output
#output.enable_custom_widget_manager()

In [3]:
import matplotlib
import matplotlib.pyplot as plt

In [4]:
%matplotlib notebook

## Data Acquisition

In [5]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data

--2022-01-19 08:06:49--  https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 373704 (365K) [application/x-httpd-php]
Saving to: ‘agaricus-lepiota.data’


2022-01-19 08:06:50 (1.41 MB/s) - ‘agaricus-lepiota.data’ saved [373704/373704]



In [6]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names

--2022-01-19 08:06:50--  https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6816 (6.7K) [application/x-httpd-php]
Saving to: ‘agaricus-lepiota.names’


2022-01-19 08:06:51 (152 MB/s) - ‘agaricus-lepiota.names’ saved [6816/6816]



In [7]:
# Dataset Description

!cat agaricus-lepiota.names

1. Title: Mushroom Database

2. Sources: 
    (a) Mushroom records drawn from The Audubon Society Field Guide to North
        American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred
        A. Knopf
    (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
    (c) Date: 27 April 1987

3. Past Usage:
    1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational
       Adjustment (Technical Report 87-19).  Doctoral disseration, Department
       of Information and Computer Science, University of California, Irvine.
       --- STAGGER: asymptoted to 95% classification accuracy after reviewing
           1000 instances.
    2. Iba,W., Wogulis,J., & Langley,P. (1988).  Trading off Simplicity
       and Coverage in Incremental Concept Learning. In Proceedings of 
       the 5th International Conference on Machine Learning, 73-79.
       Ann Arbor, Michigan: Morgan Kaufmann.  
       -- approximately the same results with their HILLARY algorithm    
    3. In 

In [8]:
# Dataset Pre-view

!head agaricus-lepiota.data

p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
e,b,s,y,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m


## Data Loading

In [9]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

In [10]:
df = pd.read_csv('agaricus-lepiota.data', header=None)

In [11]:
colnames = ["class",
"cap-shape", 
"cap-surface", 
"cap-color", 
"bruises?",
"odor",
"gill-attachment",
"gill-spacing",
"gill-size",
"gill-color",
"stalk-shape",
"stalk-root",
"stalk-surface-above-ring",
"stalk-surface-below-ring",
"stalk-color-above-ring",
"stalk-color-below-ring",
"veil-type",
"veil-color",
"ring-number",
"ring-type",
"spore-print-color",
"population",
"habitat"
]

In [12]:
df.columns = colnames

In [13]:
df['stalk-root'].replace('?', np.nan, inplace=True)

In [14]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


## Exploratory Data Analysis

**Note:** Pre-processing step is skipped as all values are categorical and catboost can deal with missing values on its own.



In [15]:
profile = ProfileReport(df, title="Exploratory Data Analysis Report", explorative=True)

In [16]:
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Data Preparation

### Data Formatting

In [17]:
# Converting NaN values to string so that catboost can deal with it

df['stalk-root'].replace(np.nan, 'Nan', inplace=True)

### Data Splitting

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
df_train_val, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [20]:
df_train, df_val = train_test_split(df_train_val, test_size=0.125, random_state=42)

In [21]:
X_train, y_train = df_train.drop('class', axis=1), df_train['class']

In [22]:
X_test, y_test = df_test.drop('class', axis=1), df_test['class']

In [23]:
X_val, y_val = df_val.drop('class', axis=1), df_val['class']

### Data Pool Creation

In [24]:
import catboost as cb

In [25]:
categorical_indicies = colnames[1:]

In [26]:
train_dataset = cb.Pool(X_train, y_train, 
                        cat_features=categorical_indicies)

In [27]:
val_dataset = cb.Pool(X_val, y_val, 
                        cat_features=categorical_indicies)

In [28]:
test_dataset = cb.Pool(X_test, y_test, 
                        cat_features=categorical_indicies)

## Training

In [29]:
model = cb.CatBoostClassifier(depth=8, 
                              iterations=100, 
                              eval_metric='Accuracy', 
                              boosting_type="Ordered", 
                              bagging_temperature=0, 
                              use_best_model=True, 
                              loss_function='Logloss')

In [30]:
model.fit(train_dataset, eval_set=val_dataset, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.13254
0:	learn: 0.9852269	test: 0.9827798	best: 0.9827798 (0)	total: 81.6ms	remaining: 8.08s
1:	learn: 0.9852269	test: 0.9827798	best: 0.9827798 (0)	total: 106ms	remaining: 5.17s
2:	learn: 0.9852269	test: 0.9827798	best: 0.9827798 (0)	total: 117ms	remaining: 3.79s
3:	learn: 0.9852269	test: 0.9827798	best: 0.9827798 (0)	total: 124ms	remaining: 2.96s
4:	learn: 0.9852269	test: 0.9827798	best: 0.9827798 (0)	total: 152ms	remaining: 2.89s
5:	learn: 0.9854027	test: 0.9827798	best: 0.9827798 (0)	total: 162ms	remaining: 2.54s
6:	learn: 0.9855786	test: 0.9827798	best: 0.9827798 (0)	total: 189ms	remaining: 2.51s
7:	learn: 0.9940204	test: 0.9926199	best: 0.9926199 (7)	total: 208ms	remaining: 2.4s
8:	learn: 0.9985930	test: 0.9987700	best: 0.9987700 (8)	total: 252ms	remaining: 2.55s
9:	learn: 0.9987689	test: 0.9987700	best: 0.9987700 (8)	total: 299ms	remaining: 2.69s
10:	learn: 0.9987689	test: 0.9987700	best: 0.9987700 (8)	total: 345ms	remaining: 2.79s
11:	learn: 1.0000000	tes

<catboost.core.CatBoostClassifier at 0x7fe4fb163e10>

## Evaluation

In [31]:
# Compute metrics for all models

model.eval_metrics(test_dataset, ['Accuracy', 'Logloss', 'AUC', 'CrossEntropy', 'Recall', 'Precision', 'F1', 'BalancedAccuracy'])

{'Accuracy': [0.9846153846153847,
  0.9846153846153847,
  0.9846153846153847,
  0.9846153846153847,
  0.9846153846153847,
  0.9846153846153847,
  0.9846153846153847,
  0.9932307692307693,
  0.9993846153846154,
  0.9993846153846154,
  0.9993846153846154,
  1.0],
 'Logloss': [0.4046603028737418,
  0.25480771267562063,
  0.1773531546348251,
  0.13900881249189365,
  0.08608024907331469,
  0.0561214522695024,
  0.03493003968495865,
  0.029374587056189828,
  0.020089262847260324,
  0.013632280477400146,
  0.009738713001664083,
  0.0072264113732767084],
 'AUC': [0.9896454326740753,
  0.9900701732031201,
  0.9900701732031201,
  0.9900701732031201,
  0.9989169116509361,
  0.9999241534769563,
  1.0,
  1.0,
  1.0,
  1.0,
  1.0,
  1.0],
 'CrossEntropy': [0.4046603028737418,
  0.25480771267562063,
  0.1773531546348251,
  0.13900881249189365,
  0.08608024907331469,
  0.0561214522695024,
  0.03493003968495865,
  0.029374587056189828,
  0.020089262847260324,
  0.013632280477400146,
  0.009738713001664

In [32]:
import seaborn as sns

In [33]:
def plot_feature_importance(importance, names, model_type):
   feature_importance = np.array(importance)
   feature_names = np.array(names)
   data={'feature_names':feature_names,
         'feature_importance':feature_importance}
   fi_df = pd.DataFrame(data) 
   fi_df.sort_values(by=['feature_importance'],    
                     ascending=False,inplace=True)
   plt.figure(figsize=(10,8))
   sns.barplot(x=fi_df['feature_importance'], 
               y=fi_df['feature_names'])
   plt.title(model_type + ' FEATURE IMPORTANCE')
   plt.xlabel('FEATURE IMPORTANCE')
   plt.ylabel('FEATURE NAMES')
   plt.show()

In [34]:
plot_feature_importance(model.get_feature_importance(), categorical_indicies, 'CATBOOST')

In [35]:
pred = model.predict(X_test)

In [36]:
pred_proba = model.predict_proba(X_test)

In [37]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, RocCurveDisplay

In [38]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           e       1.00      1.00      1.00       843
           p       1.00      1.00      1.00       782

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625



In [39]:
sns.heatmap(confusion_matrix(y_test, pred))
plt.show()

In [40]:
RocCurveDisplay.from_predictions(y_test, pred_proba[:, 0], pos_label='e')
plt.show()