Dataset Link: https://archive.ics.uci.edu/dataset/73/mushroom.

`pip install ucimlrepo` necessary to access the data using the API.

# **0 - Imports and dataset description**

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly

import sklearn
import imblearn
import category_encoders
import xgboost as xgb

from ucimlrepo import fetch_ucirepo 

from category_encoders.target_encoder import TargetEncoder

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
from sklearn.decomposition import PCA

pd.set_option('display.max_columns', None)

### Columns description

1. **cap-shape**: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
2. **cap-surface**: fibrous=f, grooves=g, scaly=y, smooth=s
3. **cap-color**: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
4. **bruises?**: bruises=t, no=f
5. **odor**: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
6. **gill-attachment**: attached=a, descending=d, free=f, notched=n
7. **gill-spacing**: close=c, crowded=w, distant=d
8. **gill-size**: broad=b, narrow=n
9. **gill-color**: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
10. **stalk-shape**: enlarging=e, tapering=t
11. **stalk-root**: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
12. **stalk-surface-above-ring**: fibrous=f, scaly=y, silky=k, smooth=s
13. **stalk-surface-below-ring**: fibrous=f, scaly=y, silky=k, smooth=s
14. **stalk-color-above-ring**: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
15. **stalk-color-below-ring**: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
16. **veil-type**: partial=p, universal=u
17. **veil-color**: brown=n, orange=o, white=w, yellow=y
18. **ring-number**: none=n, one=o, two=t
19. **ring-type**: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
20. **spore-print-color**: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
21. **population**: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
22. **habitat**: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

23. **Class Label**: edible=e, poisonous=p


Import the dataset directly from the UCI ML Repository API.

In [2]:
# fetch dataset
mushroom = fetch_ucirepo(id=73) 
  
# data (as pandas dataframes) 
X = mushroom.data.features 
y = mushroom.data.targets 
  
# merge in one unique df
mushroom = pd.concat([X, y], axis=1)
mushroom.head(5)

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u,p
1,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g,e
2,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m,e
3,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u,p
4,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g,e


In [3]:
mushroom.shape

(8124, 23)

# **1 - Data Pre-Processing**

### 1.1 - Basic Pre-Processing

Inspect the columns data types.

In [4]:
mushroom.dtypes

cap-shape                   object
cap-surface                 object
cap-color                   object
bruises                     object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
poisonous                   object
dtype: object

Check for null values.

In [5]:
mushroom.isnull().sum()

cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
poisonous                      0
dtype: int64

There is only one column displaying null values: is this column conveying some information? Yes: we can see that missing values are more likely to be poisonous. This may be useful, for this reason one potential idea is to replace the `NaN` with `missing` and then let the encoder deal with it in the next stage.

In [6]:
# mushroom['poisonous'].loc[mushroom['stalk-root'].isna()].value_counts(normalize=True)

In [7]:
# mushroom['stalk-root'].fillna('Missing', inplace=True)

Check for duplicate values.

In [8]:
mushroom.duplicated().sum()

0

### 1.2 - Variables Encoding Overview

First of all, let's encode the `poisonous` label as follows:

$$
\text{poisonous} = \begin{cases}
1 & \text{if poisonous} \\
0 & \text{otherwise}
\end{cases}
$$

In [9]:
mushroom['poisonous'] = mushroom['poisonous'].replace({'p': 1, 'e':0})

***

Since all the others 22 columns are `objects` as well, we have to encode them according to some criteria.

Here we can make a distinction: 
- if the column's domain is binary, then we replace the existing categorical values with {0, 1}
- if the column's domain is made by more than two values, I propose to encode them following the __[Label Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html)__ criteria, but feel free to share your opinion.

In [10]:
binary_columns = []
nonbinary_columns = []

for col in mushroom.columns:
    if col == 'poisonous':
        pass
    elif mushroom[col].nunique() == 2:
        binary_columns.append(col)
    else:
        nonbinary_columns.append(col)

len(binary_columns) + len(nonbinary_columns)

22

***

Stupid SkLearn needs the columns to be handled as follows.

In [11]:
columns = ['bruises', 'gill-attachment', 'gill-spacing', 'gill-size', 'stalk-shape',
           'cap-shape', 'cap-surface', 'cap-color', 'odor', 'gill-color', 'stalk-root',
           'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring',
           'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type',
           'spore-print-color', 'population', 'habitat']

Perform **train-test split**.

In [12]:
X = mushroom.loc[:, mushroom.columns != 'poisonous']
y = mushroom['poisonous']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

***

The numbers associated to TargetEncoder go from 0 to 4 and they represent the 5 columns of `binary_columns`, all the others are the `nonbinary_columns`.

In [14]:
ct = ColumnTransformer(
    transformers=[
        ('target_encoder', make_pipeline(TargetEncoder()), [0,1,2,3,4]),
        ('ordinal_encoder', make_pipeline(OrdinalEncoder()), [5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21])
                ],
    remainder='passthrough'
)

In [15]:
model = xgb.XGBClassifier(objective="binary:logistic",
                          scale_pos_weight = 10,
                          max_depth = 24,
                          n_estimators = 180,
                          learning_rate = 0.1,
                          reg_lambda = 6,
                          base_score = 0.2,
                          alpha = 1,
)

In [16]:
pipeline = Pipeline(steps=[("column transformer", ct),
                           ('model', model)])

In [17]:
pipeline.fit(X_train, y_train)
accuracy_score(pipeline.predict(X_test), y_test)

1.0

# **2 - Exploratory Data Analysis**

ImportError: cannot import name 'TargetEncoder' from 'sklearn.preprocessing' (c:\Users\giord\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\preprocessing\__init__.py)