# Mushroom Classification Project

**Name:** David Rodriguez-Mayorquin
**Date:** March 22, 2025  

## Introduction
This project demonstrates the application of classification modeling techniques using a real-world dataset from the UCI Machine Learning Repository. The goal is to predict whether a mushroom is edible or poisonous based on its physical characteristics.

Classification models are commonly used in fields such as healthcare, finance, and business analytics to support decision-making. In this project, I will:
- Load and inspect the dataset
- Analyze feature distributions
- Encode features appropriately
- Train and evaluate multiple classification models
- Compare performance metrics and draw conclusions

The dataset contains 8124 mushroom samples described by 22 categorical attributes and a binary target variable.


## Section 1: Import and Inspect the Data
### 1.1 Load the dataset and display the first 10 rows

In [5]:
# Import libraries
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Plot settings
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# Fetch the dataset directly from UCI
mushroom = fetch_ucirepo(id=73)  # Mushroom dataset

# Separate features and target
X = mushroom.data.features
y = mushroom.data.targets

# Combine X and y into one DataFrame for inspection
df = pd.concat([X, y], axis=1)

# Display first 10 rows
df.head(10)


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,x,s,n,t,p,f,c,n,k,e,...,w,w,p,w,o,p,k,s,u,p
1,x,s,y,t,a,f,c,b,k,e,...,w,w,p,w,o,p,n,n,g,e
2,b,s,w,t,l,f,c,b,n,e,...,w,w,p,w,o,p,n,n,m,e
3,x,y,w,t,p,f,c,n,n,e,...,w,w,p,w,o,p,k,s,u,p
4,x,s,g,f,n,f,w,b,k,t,...,w,w,p,w,o,e,n,a,g,e
5,x,y,y,t,a,f,c,b,n,e,...,w,w,p,w,o,p,k,n,g,e
6,b,s,w,t,a,f,c,b,g,e,...,w,w,p,w,o,p,k,n,m,e
7,b,y,w,t,l,f,c,b,n,e,...,w,w,p,w,o,p,n,s,m,e
8,x,y,w,t,p,f,c,n,p,e,...,w,w,p,w,o,p,k,v,g,p
9,b,s,y,t,a,f,c,b,g,e,...,w,w,p,w,o,p,k,s,m,e


### 1.2 Check for Missing Values and Display Summary Statistics

In [6]:
# Check for missing values
print("Missing values per column:\n")
print(df.isnull().sum())

# Summary statistics for all columns (categorical-friendly)
df.describe(include='all')

Missing values per column:

cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
poisonous                      0
dtype: int64


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,6,4,10,2,9,2,2,2,12,2,...,9,9,1,4,3,5,9,6,7,2
top,x,y,n,f,n,f,c,b,b,t,...,w,w,p,w,o,p,w,v,d,e
freq,3656,3244,2284,4748,3528,7914,6812,5612,1728,4608,...,4464,4384,8124,7924,7488,3968,2388,4040,3148,4208


## Reflection 1.

- The dataset is mostly clean, with no missing values in most columns
- The column `stalk-root has 2,480 missing values**. This will require handling before model training — either by imputing or dropping the column.
- All features are categorical and some features like odor, gill-size, ring-type, spore-print-color. etc., appear likely to be informative.