

# Midterm Project: Classification Analysis

**Author:** Anjana Dhakal,

**Date:** 11/04/2025,

**Objective:** Project Objectives 

- Explore Data: Load UCI Mushroom Dataset, inspect for issues, and visualize distributions to identify edibility patterns.
- Prepare Data: Clean missing values, encode categoricals, select key features.
- Train & Evaluate: Build Logistic Regression baseline; split data, assess with accuracy, F1, confusion matrix.
- Improve & Compare: Add Random Forest; benchmark metrics and explain differences.
- Reflect: Summarize insights, challenges, next step, and real-world applications .



## Introduction
This project focuses on classifying mushrooms as edible or poisonous using the UCI Mushroom Dataset. The dataset contains 8,124 instances described by 22 categorical attributes, such as cap shape, odor, and habitat. The target variable is binary: 'e' for edible and 'p' for poisonous. The goal is to build and evaluate classification models to predict mushroom edibility, aiding in automated identification for safety in foraging or research. This demonstrates practical machine learning for decision-making in biology and environmental science.

Using libraries: pandas for data handling, numpy for computations, matplotlib and seaborn for visualization, and scikit-learn for modeling.

## Section 1. Import and Inspect the Data

In [5]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.feature_selection import SelectKBest, chi2
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

In [4]:
import pandas as pd
import os

# URL of the .data file
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"

# Column names
columns = [
    "class","cap-shape","cap-surface","cap-color","bruises","odor",
    "gill-attachment","gill-spacing","gill-size","gill-color","stalk-shape",
    "stalk-root","stalk-surface-above-ring","stalk-surface-below-ring",
    "stalk-color-above-ring","stalk-color-below-ring","veil-type","veil-color",
    "ring-number","ring-type","spore-print-color","population","habitat"
]

# Read the .data file into a pandas DataFrame
df = pd.read_csv(url, names=columns)

# Path to your data folder
data_folder = r"C:\Repos\ml_classification_anjana\data"

# Make sure the folder exists
os.makedirs(data_folder, exist_ok=True)

# Save CSV inside the data folder
csv_path = os.path.join(data_folder, "mushroom.csv")
df.to_csv(csv_path, index=False)

print(f"Saved CSV to '{csv_path}'")


Saved CSV to 'C:\Repos\ml_classification_anjana\data\mushroom.csv'
