# Classification Of Mushroom Data

[Dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom)

## Dealing With Dependencies

In [1]:
!pip install -U pandas numpy pandas_profiling[notebook] sklearn catboost seaborn matplotlib ipywidgets



## Dealing With Google Colab Issues

In [2]:
#from google.colab import output
#output.enable_custom_widget_manager()

In [3]:
!jupyter nbextension enable --py widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [4]:
import matplotlib
import matplotlib.pyplot as plt

In [5]:
%matplotlib inline

## Data Acquisition

In [6]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data

--2022-01-19 08:24:58--  https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 373704 (365K) [application/x-httpd-php]
Saving to: ‘agaricus-lepiota.data.3’


2022-01-19 08:24:58 (1.39 MB/s) - ‘agaricus-lepiota.data.3’ saved [373704/373704]



In [7]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names

--2022-01-19 08:24:59--  https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6816 (6.7K) [application/x-httpd-php]
Saving to: ‘agaricus-lepiota.names.3’


2022-01-19 08:24:59 (154 MB/s) - ‘agaricus-lepiota.names.3’ saved [6816/6816]



In [8]:
# Dataset Description

!cat agaricus-lepiota.names

1. Title: Mushroom Database

2. Sources: 
    (a) Mushroom records drawn from The Audubon Society Field Guide to North
        American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred
        A. Knopf
    (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
    (c) Date: 27 April 1987

3. Past Usage:
    1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational
       Adjustment (Technical Report 87-19).  Doctoral disseration, Department
       of Information and Computer Science, University of California, Irvine.
       --- STAGGER: asymptoted to 95% classification accuracy after reviewing
           1000 instances.
    2. Iba,W., Wogulis,J., & Langley,P. (1988).  Trading off Simplicity
       and Coverage in Incremental Concept Learning. In Proceedings of 
       the 5th International Conference on Machine Learning, 73-79.
       Ann Arbor, Michigan: Morgan Kaufmann.  
       -- approximately the same results with their HILLARY algorithm    
    3. In 

In [9]:
# Dataset Pre-view

!head agaricus-lepiota.data

p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
e,b,s,w,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,n,m
e,b,y,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,s,m
p,x,y,w,t,p,f,c,n,p,e,e,s,s,w,w,p,w,o,p,k,v,g
e,b,s,y,t,a,f,c,b,g,e,c,s,s,w,w,p,w,o,p,k,s,m


## Data Loading

In [10]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

In [11]:
df = pd.read_csv('agaricus-lepiota.data', header=None)

In [12]:
colnames = ["class",
"cap-shape", 
"cap-surface", 
"cap-color", 
"bruises?",
"odor",
"gill-attachment",
"gill-spacing",
"gill-size",
"gill-color",
"stalk-shape",
"stalk-root",
"stalk-surface-above-ring",
"stalk-surface-below-ring",
"stalk-color-above-ring",
"stalk-color-below-ring",
"veil-type",
"veil-color",
"ring-number",
"ring-type",
"spore-print-color",
"population",
"habitat"
]

In [13]:
df.columns = colnames

In [14]:
df['stalk-root'].replace('?', np.nan, inplace=True)

In [15]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


## Exploratory Data Analysis

**Note:** Pre-processing step is skipped as all values are categorical and catboost can deal with missing values on its own.



In [16]:
profile = ProfileReport(df, title="Exploratory Data Analysis Report", explorative=True)

In [None]:
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

## Data Preparation

### Data Formatting

In [None]:
# Converting NaN values to string so that catboost can deal with it

df['stalk-root'].replace(np.nan, 'Nan', inplace=True)

### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df_train_val, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
df_train, df_val = train_test_split(df_train_val, test_size=0.125, random_state=42)

In [None]:
X_train, y_train = df_train.drop('class', axis=1), df_train['class']

In [None]:
X_test, y_test = df_test.drop('class', axis=1), df_test['class']

In [None]:
X_val, y_val = df_val.drop('class', axis=1), df_val['class']

### Data Pool Creation

In [None]:
import catboost as cb

In [None]:
categorical_indicies = colnames[1:]

In [None]:
train_dataset = cb.Pool(X_train, y_train, 
                        cat_features=categorical_indicies)

In [None]:
val_dataset = cb.Pool(X_val, y_val, 
                        cat_features=categorical_indicies)

In [None]:
test_dataset = cb.Pool(X_test, y_test, 
                        cat_features=categorical_indicies)

## Training

In [None]:
model = cb.CatBoostClassifier(depth=8, 
                              iterations=100, 
                              eval_metric='Accuracy', 
                              boosting_type="Ordered", 
                              bagging_temperature=0, 
                              use_best_model=True, 
                              loss_function='Logloss')

In [None]:
model.fit(train_dataset, eval_set=val_dataset, plot=True)

## Evaluation

In [None]:
# Compute metrics for all models

model.eval_metrics(test_dataset, ['Accuracy', 'Logloss', 'AUC', 'CrossEntropy', 'Recall', 'Precision', 'F1', 'BalancedAccuracy'])

In [None]:
import seaborn as sns

In [None]:
def plot_feature_importance(importance, names, model_type):
   feature_importance = np.array(importance)
   feature_names = np.array(names)
   data={'feature_names':feature_names,
         'feature_importance':feature_importance}
   fi_df = pd.DataFrame(data) 
   fi_df.sort_values(by=['feature_importance'],    
                     ascending=False,inplace=True)
   plt.figure(figsize=(10,8))
   sns.barplot(x=fi_df['feature_importance'], 
               y=fi_df['feature_names'])
   plt.title(model_type + ' FEATURE IMPORTANCE')
   plt.xlabel('FEATURE IMPORTANCE')
   plt.ylabel('FEATURE NAMES')
   plt.show()

In [None]:
%matplotlib inline

In [None]:
plot_feature_importance(model.get_feature_importance(), categorical_indicies, 'CATBOOST')

In [None]:
pred = model.predict(X_test)

In [None]:
pred_proba = model.predict_proba(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, RocCurveDisplay

In [None]:
print(classification_report(y_test, pred))

In [None]:
sns.heatmap(confusion_matrix(y_test, pred), annot=True)
plt.show()

In [None]:
RocCurveDisplay.from_predictions(y_test, pred_proba[:, 0], pos_label='e')
plt.show()