## **Assignment 3 (2024/2): ML1**
**Safe to eat or deadly poison?**



This homework is a classification task to identify whether a mushroom is edible or poisonous.

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981).

Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the credibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.


Step 1. Load 'mushroom2020_dataset.csv' data from the “Attachment” (note: this data set has been preliminarily prepared.).

Step 2. Drop rows where the target (label) variable is missing.

Step 3. Drop the following variables:
'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'

Step 4. Examine the number of rows, the number of digits, and whether any are missing.

Step 5. Fill missing values by adding the mean for numeric variables and the mode for nominal variables.

Step 6. Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1

Step 7. Convert the nominal variable to numeric using a dummy code with drop_first = True.

Step 8. Split train/test with 20% test, stratify, and seed = 2020.

Step 9. Create a Random Forest with GridSearch on training data with 5 CV.
	'criterion':['gini','entropy']
'max_depth': [2,3]
'min_samples_leaf':[2,5]
'N_estimators':[100]
'random_state': 2020

Step 10.  Predict the testing data set with classification_report.


**Complete class MushroomClassifier from given code template below.**

In [3]:
#import your other libraries here
import pandas as pd
# hint
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# from sklearn.model_selection import ...
# from sklearn.ensemble import ...


In [4]:
hw = pd.read_csv('mushroom2020_dataset.csv')
hw.head()

Unnamed: 0,id,label,cap-shape,cap-surface,bruises,odor,gill-attachment,gill-spacing,gill-size,stalk-shape,...,ring-number,ring-type,spore-print-color,population,habitat,cap-color-rate,gill-color-rate,veil-color-rate,stalk-color-above-ring-rate,stalk-color-below-ring-rate
0,1,p,x,s,t,p,f,c,n,e,...,o,p,k,s,u,1.0,3.0,1.0,1.0,1.0
1,2,e,x,s,t,a,f,c,b,e,...,o,p,n,n,g,2.0,3.0,1.0,1.0,1.0
2,3,e,b,s,t,l,f,c,b,e,...,o,p,n,n,m,3.0,1.0,1.0,1.0,1.0
3,4,p,x,y,t,p,f,c,n,e,...,o,p,k,s,u,3.0,1.0,1.0,1.0,1.0
4,5,e,x,s,f,n,f,w,b,t,...,o,e,n,a,g,4.0,3.0,1.0,1.0,1.0


In [5]:
# hw['gill-size'].isna().sum()
hw.dropna(subset=['label'], inplace=True)
hw.drop(columns=['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'], inplace=True)
hw.reset_index(inplace=True)

In [6]:
num_imp = SimpleImputer(missing_values=np.nan, strategy='mean')
hw[['cap-color-rate']] = pd.DataFrame(num_imp.fit_transform(hw[['cap-color-rate']]))

cat_imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
columns = ['cap-shape', 'cap-surface', 'bruises', 'odor', 'stalk-shape', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
hw[columns] = pd.DataFrame(cat_imp.fit_transform(hw[columns]))

In [7]:
# hw["label"] = pd.DataFrame([1 if v == 'e' else 0 for v in hw["label"]])
hw["label"] = hw["label"].map({'e':1, 'p':0})
((hw["label"] == 0).sum(), (hw["label"] == 1).sum())

(np.int64(3660), np.int64(2104))

In [8]:
print(hw.shape)
hw.isnull().sum()

(5764, 13)


index                0
label                0
cap-shape            0
cap-surface          0
bruises              0
odor                 0
stalk-shape          0
ring-number          0
ring-type            0
spore-print-color    0
population           0
habitat              0
cap-color-rate       0
dtype: int64

In [9]:
hw

Unnamed: 0,index,label,cap-shape,cap-surface,bruises,odor,stalk-shape,ring-number,ring-type,spore-print-color,population,habitat,cap-color-rate
0,0,0,x,s,t,p,e,o,p,k,s,u,1.0
1,1,1,x,s,t,a,e,o,p,n,n,g,2.0
2,2,1,b,s,t,l,e,o,p,n,n,m,3.0
3,3,0,x,y,t,p,e,o,p,k,s,u,3.0
4,4,1,x,s,f,n,t,o,e,n,a,g,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5759,5819,1,k,s,f,n,e,o,p,b,c,l,1.0
5760,5820,1,x,s,f,n,e,o,p,b,v,l,1.0
5761,5821,1,f,s,f,n,e,o,p,b,c,l,1.0
5762,5822,0,k,y,f,y,t,o,e,w,v,l,1.0


In [12]:
nominal_cols = ['cap-shape','cap-surface','bruises','odor','stalk-shape','ring-number','ring-type','spore-print-color','population','habitat']
dummy = pd.get_dummies(hw[nominal_cols], drop_first=True)
hw = pd.concat([hw, dummy], axis=1)
hw.drop(columns=nominal_cols, inplace=True)
hw.drop(columns=["index"], inplace=True)
hw.shape

(5764, 43)

In [13]:
hw

Unnamed: 0,label,cap-color-rate,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,bruises_t,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,1.0,False,False,False,True,False,True,False,True,...,False,True,False,False,False,False,False,False,True,False
1,1,2.0,False,False,False,True,False,True,False,True,...,True,False,False,False,True,False,False,False,False,False
2,1,3.0,False,False,False,False,False,True,False,True,...,True,False,False,False,False,False,True,False,False,False
3,0,3.0,False,False,False,True,False,False,True,True,...,False,True,False,False,False,False,False,False,True,False
4,1,4.0,False,False,False,True,False,True,False,False,...,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5759,1,1.0,False,False,True,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
5760,1,1.0,False,False,False,True,False,True,False,False,...,False,False,True,False,False,True,False,False,False,False
5761,1,1.0,False,True,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
5762,0,1.0,False,False,True,False,False,False,True,False,...,False,False,True,False,False,True,False,False,False,False


In [14]:
y = hw.pop('label')
X = hw
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=2020)
(X_train.shape, X_test.shape)

((4611, 42), (1153, 42))

In [15]:
rf_clf = RandomForestClassifier(random_state=2020)
param_grid = {
    'criterion': ['gini','entropy'],
    'max_depth': [2,3],
    'min_samples_leaf': [2,5],
    'n_estimators': [100],
}

grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
params = grid_search.best_params_
(params['criterion'], params['max_depth'], params['min_samples_leaf'], params['n_estimators'])
# params

('gini', 3, 5, 100)

In [16]:
params

{'criterion': 'gini',
 'max_depth': 3,
 'min_samples_leaf': 5,
 'n_estimators': 100}

In [17]:
y_pred = grid_search.predict(X_test)
a = classification_report(y_test, y_pred, output_dict=True)
a['0']['f1-score']

0.9814941740918437

In [18]:
a

{'0': {'precision': 0.984869325997249,
  'recall': 0.9781420765027322,
  'f1-score': 0.9814941740918437,
  'support': 732.0},
 '1': {'precision': 0.9624413145539906,
  'recall': 0.9738717339667459,
  'f1-score': 0.9681227863046045,
  'support': 421.0},
 'accuracy': 0.9765828274067649,
 'macro avg': {'precision': 0.9736553202756197,
  'recall': 0.9760069052347391,
  'f1-score': 0.9748084801982241,
  'support': 1153.0},
 'weighted avg': {'precision': 0.9766800867798927,
  'recall': 0.9765828274067649,
  'f1-score': 0.9766118200082116,
  'support': 1153.0}}

In [212]:
# enc = OneHotEncoder(handle_unknown='ignore')
# nominal_cols = ['cap-shape','cap-surface','bruises','odor','stalk-shape','ring-number','ring-type','spore-print-color','population','habitat']
# enc_df = pd.DataFrame(enc.fit_transform(hw[nominal_cols]).toarray())

# unique_vals = enc.categories_
# new_col_names = []
# for i, vals in enumerate(unique_vals):
#     for val in vals:
#         new_col_names.append(f"{nominal_cols[i]}_{val}")

# enc_df.columns = new_col_names
# hw = pd.concat([hw, enc_df], axis=1)
# hw.drop(columns=nominal_cols, axis=1, inplace=True)
# hw.shape

In [213]:
hw

Unnamed: 0,cap-color-rate,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,bruises_t,odor_c,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,1.0,False,False,False,True,False,True,False,True,False,...,False,True,False,False,False,False,False,False,True,False
1,2.0,False,False,False,True,False,True,False,True,False,...,True,False,False,False,True,False,False,False,False,False
2,3.0,False,False,False,False,False,True,False,True,False,...,True,False,False,False,False,False,True,False,False,False
3,3.0,False,False,False,True,False,False,True,True,False,...,False,True,False,False,False,False,False,False,True,False
4,4.0,False,False,False,True,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5759,1.0,False,False,True,False,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
5760,1.0,False,False,False,True,False,True,False,False,False,...,False,False,True,False,False,True,False,False,False,False
5761,1.0,False,True,False,False,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
5762,1.0,False,False,True,False,False,False,True,False,False,...,False,False,True,False,False,True,False,False,False,False


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer

class MushroomClassifier:
    def __init__(self, data_path): # DO NOT modify this line
        self.data_path = data_path
        self.df = pd.read_csv(data_path)

    def Q1(self): # DO NOT modify this line
        """
            1. (From step 1) Before doing the data prep., how many "na" are there in "gill-size" variables?
        """
        # remove pass and replace with you code
        return self.df['gill-size'].isna().sum()

    def q2(self):
        self.df.dropna(subset=['label'], inplace=True)
        self.df.drop(columns=['id', 'gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'], inplace=True)
        self.df.reset_index(inplace=True)

    def Q2(self): # DO NOT modify this line
        """
            2. (From step 2-4) How many rows of data, how many variables?
            - Drop rows where the target (label) variable is missing.
            - Drop the following variables:
            'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'
            - Examine the number of rows, the number of digits, and whether any are missing.
        """
        # remove pass and replace with you code
        self.q2()
        self.df.drop(columns=['label'], inplace=True)
        
        return self.df.shape

    def q3(self):
        num_features = ['cap-color-rate']
        cat_features = ['cap-shape', 'cap-surface', 'bruises', 'odor', 'stalk-shape',
                    'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
        
        num_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),
        ])
        cat_pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent'))
        ])

        preprocessor = ColumnTransformer([
            ('num', num_pipeline, num_features),
            ('cat', cat_pipeline, cat_features)
        ])
        processed = preprocessor.fit_transform(self.df)
         # keep other columns e.g. label
        other_cols = self.df.drop(columns=(num_features + cat_features))

        processed_df = pd.DataFrame(processed, columns=num_features + cat_features, index=self.df.index)
        self.df = pd.concat([processed_df, other_cols], axis=1)
        self.df["label"] = self.df["label"].map({'e':1, 'p':0})

        # num_imp = SimpleImputer(missing_values=np.nan, strategy='mean')
        # self.df[['cap-color-rate']] = pd.DataFrame(num_imp.fit_transform(self.df[['cap-color-rate']]))
        # cat_imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
        # columns = ['cap-shape', 'cap-surface', 'bruises', 'odor', 'stalk-shape', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']
        # self.df[columns] = pd.DataFrame(cat_imp.fit_transform(self.df[columns]))


    def Q3(self): # DO NOT modify this line
        """
            3. (From step 5-6) Answer the quantity class0:class1
            - Fill missing values by adding the mean for numeric variables and the mode for nominal variables.
            - Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1
            - Note: You need to reproduce the process (code) from Q2 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.q3()

        return ((self.df["label"] == 0).sum(), (self.df["label"] == 1).sum())

    def q4(self) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        nominal_cols = ['cap-shape','cap-surface','bruises','odor','stalk-shape','ring-number','ring-type','spore-print-color','population','habitat']
        # dummy: cap-shape => cap-shape_c cap-shape_f cap-shape_k cap-shape_x	
        dummy = pd.get_dummies(self.df[nominal_cols], drop_first=True)
        self.df = pd.concat([self.df, dummy], axis=1)
        self.df.drop(columns=nominal_cols, inplace=True)
        self.df.drop(columns=["index"], inplace=True)

        y = self.df.pop('label')
        X = self.df
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=2020)

        return X_train, X_test, y_train, y_test

    def Q4(self): # DO NOT modify this line
        """
            4. (From step 7-8) How much is each training and testing sets
            - Convert the nominal variable to numeric using a dummy code with drop_first = True.
            - Split train/test with 20% test, stratify, and seed = 2020.
            - Note: You need to reproduce the process (code) from Q2, Q3 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.q2()
        self.q3()
        X_train, X_test, y_train, y_test = self.q4()
        
        return (X_train.shape, X_test.shape)

    def q5(self, X_train, y_train):
        rf_clf = RandomForestClassifier(random_state=2020)
        param_grid = {
            'criterion': ['gini','entropy'],
            'max_depth': [2,3],
            'min_samples_leaf': [2,5],
            'n_estimators': [100],
        }

        grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5)
        grid_search.fit(X_train, y_train) 

        return grid_search

    def Q5(self):
        """
            5. (From step 9) Best params after doing random forest grid search.
            Create a Random Forest with GridSearch on training data with 5 CV.
            - 'criterion':['gini','entropy']
            - 'max_depth': [2,3]
            - 'min_samples_leaf':[2,5]
            - 'N_estimators':[100]
            - 'random_state': 2020
            - Note: You need to reproduce the process (code) from Q2, Q3, Q4 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.q2()
        self.q3()
        X_train, X_test, y_train, y_test = self.q4()
        grid_search = self.q5(X_train, y_train)
        
        params = grid_search.best_params_
        
        return (params['criterion'], params['max_depth'], params['min_samples_leaf'], params['n_estimators'], 2020)


    def Q6(self):
        """
            5. (From step 10) What is the value of macro f1 (2 digits)?
            Predict the testing data set with confusion_matrix and classification_report,
            using scientific rounding (less than 0.5 dropped, more than 0.5 then increased)
            - Note: You need to reproduce the process (code) from Q2, Q3, Q4, Q5 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.q2()
        self.q3()
        X_train, X_test, y_train, y_test = self.q4()
        grid_search = self.q5(X_train, y_train)

        y_pred = grid_search.predict(X_test)
        report = classification_report(y_test, y_pred, output_dict=True)

        return (round(report['0']['f1-score'], 2), round(report['1']['f1-score'], 2))


Run the code below to test that your code can work.

In [2]:
hw = MushroomClassifier('mushroom2020_dataset.csv')

# print(hw.Q1())
# print(hw.Q2())
# print(hw.Q3())
# print(hw.Q4())
# print(hw.Q5())
print(hw.Q6())

(0.98, 0.97)
