# Mushroom Classifier


## Name: Brian Beadell

### Overview
In my analysis, I aimed to predict whether mushrooms are edible or poisonous based on their features. I started by exploring the dataset's structure and preprocessing it to handle missing values and encode categorical variables. I used the KNeighborsClassifier algorithm to impute missing values, particularly in the "stalk-root" feature. For model training, I employed both RandomForestClassifier and LogisticRegression models on the preprocessed dataset, using one-hot encoding for features and label encoding for the response variable. To evaluate model performance, I considered metrics such as accuracy, precision, recall, and computational time. Key findings shed light on mushroom classification insights and their implications across various domains. Challenges encountered included dealing with missing data and computational constraints, which I addressed through thorough preprocessing and model exploration. Moving forward, I suggest refining models, exploring new techniques, and collaborating with domain experts for real-world applications.

In this section, I import the libraries required for the analysis. I start with pandas and numpy for data manipulation and numerical operations, respectively. Then, I import various modules from scikit-learn, including train_test_split for splitting the data, KNeighborsClassifier for implementing the K-Nearest Neighbors algorithm, OneHotEncoder and LabelEncoder for encoding categorical features, Counter for counting occurrences, RandomForestClassifier and LogisticRegression for training classification models, and accuracy_score, precision_score, and recall_score for evaluating model performance. Finally, I import PCA for performing Principal Component Analysis. These libraries are essential for data preprocessing, model training, and evaluation.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.decomposition import PCA

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


In this code, I'm defining a list named columns containing the names of the features present in a dataset related to mushrooms. These features include attributes like cap shape, color, odor, and more. By explicitly listing these column names, I aim to ensure that when I load the dataset using Pandas `read_csv` function, each attribute is correctly labeled. Additionally, I'm specifying the na_values parameter to handle missing values denoted by '?' in the dataset. This step is crucial for data preprocessing, as it allows me to effectively handle missing data during analysis.

In [2]:
# Column names
columns = ['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor',
           'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
           'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
           'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color',
           'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']

In [3]:
# Load the data
data = pd.read_csv('agaricus-lepiota.data', header=None, names=columns, na_values='?')

In this section, I'm addressing missing data in the dataset related to mushrooms. First, I'm identifying instances where the 'stalk-root' feature contains missing values and storing them in a DataFrame named missing_data. Then, I'm creating a DataFrame named present_data by removing rows with missing values in the 'stalk-root' feature.

After handling missing data, I'm defining the features and labels for the analysis. The features `X_present` consist of all columns except 'stalk-root', while the labels `y_present` correspond to the 'stalk-root' feature from the cleaned dataset. This separation enables me to prepare the data for further analysis and modeling.

In [4]:
# Handle missing data
missing_data = data[data['stalk-root'].isna()]
present_data = data.dropna(subset=['stalk-root'])

In [5]:
# Assign features and labels
X_present = present_data.drop(columns=['stalk-root'])
y_present = present_data['stalk-root']

This section handles encoding for categorical features. OneHotEncoder converts categorical features in `X_present` into binary vectors, while LabelEncoder encodes labels in `y_present`. After encoding, the data is split into training and test sets for model validation, with 80% for training and 20% for testing, ensuring the model's generalization ability is assessed on unseen data.

In [6]:
# Encoding
enc = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_present_enc = enc.fit_transform(X_present)
le = LabelEncoder()
y_present_enc = le.fit_transform(y_present)

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X_present_enc, y_present_enc,test_size=.2, random_state=42)

This section defines functions for KNN classification:

- `get_neighbors`: Calculates distances between the test row and all training set rows to identify the nearest neighbors based on the specified number `num_neighbors`.

- 'predict_classification': Predicts the class label for a test row by determining the most common class label among its nearest neighbors.

Using these functions, predictions are generated for the test set with num_neighbors set to 3.

In [7]:
# Define KNN functions
def get_neighbors(X_train, test_row, num_neighbors):
    distances = np.linalg.norm(X_train - test_row, axis=1)
    sorted_row_indices=np.argsort(distances)
    return sorted_row_indices[:num_neighbors]

def predict_classification(X_train, y_train, test_row, num_neighbors):
    neighbor_indices = get_neighbors(X_train, test_row, num_neighbors)
    output_values = y_train[neighbor_indices]
    output_values_counter= Counter(output_values)
    prediction = output_values_counter.most_common(1)[0][0]
    return prediction

num_neighbors = 3
predictions = [predict_classification(X_train, y_train, row, num_neighbors) for row in X_test]

In this section, missing values in the 'stalk-root' column are predicted and imputed. The process involves preparing features from the missing data and transforming them using a pre-fitted encoder. Predictions for the encoded features are made using KNN, and these predicted labels are decoded into their original categories. Unique values along with their counts are then displayed, providing insights into the distribution of imputed values. The imputed values are assigned back into the original dataset, completing the imputation process. Finally, the remaining count of missing values is printed, confirming the success of the imputation.

In [8]:
# Predicting missing values for 'stalk-root' column
X_missing = missing_data.drop(columns=['stalk-root'])
X_missing_enc = enc.transform(X_missing)

# Predicting encoded labels
missing_labels_enc = [predict_classification(X_train, y_train, row, num_neighbors)for row in X_missing_enc]

# Decoding labels into original categories
missing_values_imputed = le.inverse_transform(missing_labels_enc)
 
missing_values = list(missing_values_imputed) # Generate list missing_values
unique_values, counts = np.unique(missing_values, return_counts=True)

for value, count in zip(unique_values, counts): # Display unique values and counts
    print(f"{value}: {count}")

missing_data_indices = missing_data.index # Impute missing values back into the original dataset
for i, index in enumerate(missing_data_indices):
    data.at[index, 'stalk-root'] = missing_values[i]

# Print the unique values of missing_values along with the count of each value
print(f"Unique values of missing_values: {data['stalk-root'].isna().sum()}")

b: 1904
c: 60
e: 516
Unique values of missing_values: 0


**Graded Concept Question #1:** Why don’t we one-hot encode the response data to train the KNN model instead? 

One-hot encoding was not utilized for the response data in KNN models because it would transform the categorical targets into binary columns. Using one-hot encoding for the target variable won't preserve the proximity necessary for KNN's operation. Label encoding maintains the categorical items in a single-column format for KNN's classification strategy which is based on feature proximity.

In this section, the dataset undergoes encoding for modeling:

- Features (`X`) and the response variable (`y`) are isolated.
- Features are one-hot encoded using `OneHotEncoder`, ensuring categorical variables are represented as binary vectors.
- The response variable is label encoded with `LabelEncoder` to convert categorical labels into numerical values.
- The dataset is divided into training and testing sets using `train_test_split`, with 20% allocated for testing and a specified random state for consistency.

In [9]:
#Encoding dataset
X = data.drop(columns=['class']) #Features
y = data['class'] #Response variable

# One-hot encode features
enc = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_enc = enc.fit_transform(X)

#Label encode response variable
le = LabelEncoder()
y_enc = le.fit_transform(y)

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X_enc, y_enc, test_size=.2, random_state=42)

In this section, two classifiers, RandomForestClassifier and LogisticRegression, are instantiated and trained:

- RandomForestClassifier is instantiated with a specified random state.
- Training time for RandomForestClassifier is measured using `%time`.
- LogisticRegression is instantiated with a specified random state.
- Training time for LogisticRegression is measured using `%time`.

In [10]:
# Instantiate RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
%time rf.fit(X_train, y_train)                            

CPU times: user 495 ms, sys: 13.2 ms, total: 508 ms
Wall time: 515 ms


In [11]:
# Instantiate LogisticRegression
lr = LogisticRegression(random_state=42)
%time lr.fit(X_train, y_train)

CPU times: user 313 ms, sys: 22.9 ms, total: 336 ms
Wall time: 199 ms


**Graded Concept Question #2:** Could we train these two models by one-hot encoding the response data instead, being careful to specify that the drop parameter of the OneHotEncoder class is set to ‘first’? Why or why not?

One-hot encoding the response in classification tasks like predicting mushroom edibility is unusual. Logistic Regression and RandomForest handle categorical targets without one-hot encoding. For input features without ordinal relationships, one-hot encoding is beneficial, but for the response variable, direct label encoding is simpler and more efficient.

In this section, predictions are made for both RandomForestClassifier and LogisticRegression models on the test set:

- Predictions for RandomForestClassifier `rf_pred` and LogisticRegression `lr_pred` are generated using the `predict` method.
- Accuracy scores `rf_acc` and `lr_acc` are calculated using `accuracy_score` by comparing predicted labels with true labels from the test set.
- Precision scores `rf_precision` and `lr_precision` are calculated using `precision_score`, representing the proportion of correctly predicted positive instances among all instances predicted as positive.
- Recall scores `rf_recall` and `lr_recall` are calculated using `recall_score`, representing the proportion of correctly predicted positive instances among all actual positive instances.

These metrics provide insights into the performance of both models in terms of accuracy, precision, and recall.
The returned values indicate that both models attained 100% accuracy, precision, and recall on the test dataset.

In [12]:
rf_pred = rf.predict(X_test)
rf_acc = accuracy_score(y_test, rf_pred)

lr_pred = lr.predict(X_test)
lr_acc = accuracy_score(y_test,lr_pred)

rf_precision = precision_score(y_test, rf_pred) # RandomForest precision score
rf_recall = recall_score(y_test, rf_pred) # RandomForest recall score

lr_precision = precision_score(y_test, lr_pred) # LogisticRegression precision score
lr_recall = recall_score(y_test, lr_pred) # LogisticRegression recall score

(rf_acc, lr_acc, rf_precision, lr_precision, rf_recall, lr_recall)

(1.0, 1.0, 1.0, 1.0, 1.0, 1.0)

In this section, PCA (Principal Component Analysis) is applied to reduce the dimensionality of the training data:

- PCA is instantiated with a specified target explained variance ratio of 95% and a random state for reproducibility.
- PCA is fitted to the training data `X_train` to learn the transformation.
- The original number of features `org_shape`, the reduced number of features after PCA transformation `reduced_shape`, and the percentage reduction in dimensionality `precent_reduced` are calculated and displayed. This provides insights into the effectiveness of dimensionality reduction achieved by PCA.

After the reduction, the number of dimensions of the training set was reduced by approximately 65%. There are 40 features after reducing dimensionality.

In [13]:
pca =PCA(n_components=.95, random_state=42) # Instantiate PCA

X_train_pca = pca.fit_transform(X_train) # Fit PCA

# Calculation of precent reduction
org_shape = X_train.shape[1]
reduced_shape = X_train_pca.shape[1]
precent_reduced = (1 -reduced_shape / org_shape) * 100

(org_shape, reduced_shape, precent_reduced)

(116, 40, 65.51724137931035)

In this section, the test set is prepared using PCA transformation:

- The test set `X_test` is transformed into the reduced feature space using the previously fitted PCA model `pca`.

Then, models are instantiated and trained on the reduced dataset:

- RandomForestClassifier `rf_pca` is instantiated with a specified random state.
- Training time for RandomForestClassifier on the reduced dataset is measured using `%time`.
- LogisticRegression `lr_pca` is instantiated with a specified random state.
- Training time for LogisticRegression on the reduced dataset is measured using `%time`.

In [14]:
X_test_pca = pca.transform(X_test) # Preparing test set

#Instantiate and train RandomForestCLassifier on the reduced dataset
rf_pca = RandomForestClassifier(random_state=42)
%time rf_pca.fit(X_train_pca, y_train)

CPU times: user 3.42 s, sys: 50.7 ms, total: 3.47 s
Wall time: 3.41 s


In [15]:
# Instantiate and train LogisticRegression model on reduced data
lr_pca = LogisticRegression(random_state=42)
%time lr_pca.fit(X_train_pca, y_train)

CPU times: user 88.9 ms, sys: 8.69 ms, total: 97.5 ms
Wall time: 78.4 ms


In this section, predictions are made using the trained RandomForestClassifier `rf_pca` and LogisticRegression `lr_pca` models on the reduced test set:

- Predictions for both models are generated using the `predict` method on the reduced test set `X_test_pca`.
- Accuracy scores `rf_pca_acc` and `lr_pca_acc`, precision scores `rf_pca_precision` and `lr_pca_precision`, and recall scores `rf_pca_recall` and `lr_pca_recall` are calculated for both models using the corresponding metrics functions from scikit-learn.

These metrics provide insights into the performance of the models trained on the reduced dataset in terms of accuracy, precision, and recall.

In [16]:
rf_pca_pred = rf_pca.predict(X_test_pca)
lr_pca_pred = lr_pca.predict(X_test_pca)

rf_pca_acc = accuracy_score(y_test,rf_pca_pred)
rf_pca_precision = precision_score(y_test, rf_pca_pred)
rf_pca_recall = recall_score(y_test, rf_pca_pred)

lr_pca_acc = accuracy_score(y_test, lr_pca_pred)
lr_pca_precision = precision_score(y_test, lr_pca_pred)
lr_pca_recall = recall_score(y_test, lr_pca_pred)

In this section, I present a structured comparison of model performance and training times for both the full data and PCA reduced datasets:

- **Structured Data**: I organize the data into a structured format with columns representing different aspects of model evaluation, including model type, performance metrics (accuracy, precision, recall), and training time.
- **Model Performance**: For each model type (RandomForest and LogisticRegression), I report accuracy, precision, and recall scores for both the full data and PCA reduced datasets. These metrics provide insights into how well the models perform in terms of correctly predicting outcomes and capturing true positives.
- **Training Time**: I also include the training time for each model type on both datasets. This metric indicates the computational efficiency of each model in terms of the time required to train on the given dataset size.
- **DataFrame Presentation**: The structured data is presented in a DataFrame format for clarity and easy comparison. Each row represents a specific aspect of model evaluation, facilitating a side-by-side comparison of metrics across different models and datasets.

Overall, this structured comparison provides a comprehensive overview of model performance and computational efficiency for both the full data and PCA reduced datasets, aiding in the decision-making process for model selection and optimization. I utilized the `~` symbol to indicate the approximate times because I was unable to figure out how to populate the total time in the structured comparison.

In [20]:
data = {
    "Model": ["RandomForest", "", "", "", "LogisticRegression", "", "", ""],
    "Item": ["Accuracy", "Precision", "Recall", "Time", "Accuracy", "Precision", "Recall", "Time"],
    "Full Data": [rf_acc, rf_precision, rf_recall, '~502 ms', 
                  lr_acc, lr_precision, lr_recall, '~289 ms'],
    "PCA Reduced": [rf_pca_acc, rf_pca_precision, rf_pca_recall, '~3.32 s',
                    lr_pca_acc, lr_pca_precision, lr_pca_recall, '~97.5 ms']
}

# Convert the structured data into a DataFrame
structured_df = pd.DataFrame(data)

# Display the DataFrame without the index
print(structured_df.to_string(index=False))

             Model      Item Full Data PCA Reduced
      RandomForest  Accuracy       1.0         1.0
                   Precision       1.0         1.0
                      Recall       1.0         1.0
                        Time   ~502 ms     ~3.32 s
LogisticRegression  Accuracy       1.0    0.995692
                   Precision       1.0    0.994891
                      Recall       1.0    0.996164
                        Time   ~289 ms    ~97.5 ms


### Conclusion
In considering the potential for overfitting, I examined the models' performance on both the full and PCA-reduced datasets. By comparing metrics like accuracy, precision, and recall between the two, I assessed whether the models were overfitting. After evaluating the models, I drew conclusions based on their performance. The full dataset models generally achieved higher scores, but the PCA-reduced models showed reduced execution times, indicating a trade-off between performance and computational efficiency.