##**Assignment 3 (2023/2): ML1**
**Safe to eat or deadly poison?**



This homework is a classification task to identify whether a mushroom is edible or poisonous.

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981).

Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the credibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.


Step 1. Load 'mushroom2020_dataset.csv' data from the “Attachment” (note: this data set has been preliminarily prepared.).

Step 2. Drop rows where the target (label) variable is missing.

Step 3. Drop the following variables:
'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'

Step 4. Examine the number of rows, the number of digits, and whether any are missing.

Step 5. Fill missing values by adding the mean for numeric variables and the mode for nominal variables.

Step 6. Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1

Step 7. Convert the nominal variable to numeric using a dummy code with drop_first = True.

Step 8. Split train/test with 20% test, stratify, and seed = 2020.

Step 9. Create a Random Forest with GridSearch on training data with 5 CV with n_jobs=-1.
	'criterion':['gini','entropy']
'max_depth': [2,3]
'min_samples_leaf':[2,5]
'N_estimators':[100]
'random_state': 2020

Step 10.  Predict the testing data set with classification_report.


**Complete class MushroomClassifier from given code template below.**

In [247]:
#import your other libraries here
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [248]:
class MushroomClassifier:
    def __init__(self, data_path): # DO NOT modify this line
        self.data_path = data_path
        self.df = pd.read_csv(data_path)

    def Q1(self): # DO NOT modify this line
        """
            1. (From step 1) Before doing the data prep., how many "na" are there in "gill-size" variables?
        """
        count_na = self.df['gill-size'].isna().sum()
        # remove pass and replace with you code
        return count_na

    def Q2(self): # DO NOT modify this line
        """
            2. (From step 2-4) How many rows of data, how many variables?
            - Drop rows where the target (label) variable is missing.
            - Drop the following variables:
            'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'
            - Examine the number of rows, the number of digits, and whether any are missing.
        """
        # remove pass and replace with you code
        self.df.dropna(axis=0,subset='label',inplace=True)
        self.df.drop(['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'],axis=1,inplace=True)
        return self.df.shape

    def Q3(self): # DO NOT modify this line
        """
            3. (From step 5-6) Answer the quantity class0:class1
            - Fill missing values by adding the mean for numeric variables and the mode for nominal variables.
            - Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1
        """
        # remove pass and replace with you code
        self.Q2()
        mean_imp=SimpleImputer(missing_values=np.NaN, strategy='mean')
        mode_imp=SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

        numeric_df = self.df.select_dtypes(include=[np.number])
        mode_df = self.df.select_dtypes(include=[object])

        self.df[numeric_df.columns] = mean_imp.fit_transform(numeric_df)
        self.df[mode_df.columns] = mode_imp.fit_transform(mode_df)

        self.df['label'] = self.df['label'].map({'p': 0, 'e': 1})
        return self.df["label"].value_counts()


    def Q4(self): # DO NOT modify this line
        """
            4. (From step 7-8) How much is each training and testing sets
            - Convert the nominal variable to numeric using a dummy code with drop_first = True.
            - Split train/test with 20% test, stratify, and seed = 2020.
        """
        # remove pass and replace with you code
        self.Q3()
        self.df = pd.get_dummies(self.df, drop_first=True)

        X = self.df.drop("label", axis=1, inplace=False)
        y= self.df["label"]

        self.X_train,self.X_test,self.y_train,self.y_test = train_test_split(X, y ,stratify=y, test_size=0.2, random_state=2020)
        return self.X_train.shape,self.X_test.shape



    def Q5(self):
        """
            5. (From step 9) Best params after doing random forest grid search.
            Create a Random Forest with GridSearch on training data with 5 CV with n_jobs=-1.
            - 'criterion':['gini','entropy']
            - 'max_depth': [2,3]
            - 'min_samples_leaf':[2,5]
            - 'N_estimators':[100]
            - 'random_state': 2020
        """
        # remove pass and replace with you code
        self.Q4()
        my_param_grid = {
            'criterion':['gini','entropy'],
            'max_depth': [2,3],
            'min_samples_leaf':[2,5],
            'n_estimators':[100],
            'random_state': [2020],
        }
        grid_search = GridSearchCV(estimator=RandomForestClassifier(),param_grid=my_param_grid,cv=5, n_jobs=-1)
        self.best_model = grid_search.fit(self.X_train, self.y_train)
        return grid_search.best_params_

    def Q6(self):
        """
            5. (From step 10) What is the value of macro f1 (Beware digit !)
            Predict the testing data set with confusion_matrix and classification_report,
            using scientific rounding (less than 0.5 dropped, more than 0.5 then increased)
        """
        # remove pass and replace with you code
        self.Q5()
        predictions = self.best_model.predict(self.X_test)
        return classification_report(self.y_test,predictions)


Run the code below to only test that your code can work, and there is no need to submit it to the grader.

In [249]:
def main():
    hw = MushroomClassifier('mushroom2020_dataset.csv')
    exec(input().strip()) # do not delete this line

if __name__ == "__main__":
    main()

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       732
           1       0.96      0.97      0.97       421

    accuracy                           0.98      1153
   macro avg       0.97      0.98      0.97      1153
weighted avg       0.98      0.98      0.98      1153

