# Mushroom Classification — Improved Logistic Regression Baseline

This notebook trains a Logistic Regression classifier with stratified train/test split, class_weight='balanced', and Stratified 5-Fold cross-validation.
It includes preprocessing (One-Hot encoding), model training, evaluation, and visualizations (confusion matrix & ROC curve).


In [None]:
import pandas as pd
df = pd.read_csv("mushrooms_sample.csv")
df.head()


In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
import numpy as np

X = df.drop(columns=['class'])
y = df['class']

cat_cols = X.select_dtypes(include=['object']).columns.tolist()
pre = ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_cols)])
model = Pipeline(steps=[('pre', pre), ('clf', LogisticRegression(max_iter=2000, class_weight='balanced', solver='lbfgs'))])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
model.fit(X_train, y_train)

train_preds = model.predict(X_train)
test_preds = model.predict(X_test)

train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)
f1 = f1_score(y_test, test_preds, average='weighted')
print("Train Accuracy:", train_acc)
print("Test Accuracy:", test_acc)
print("F1 Score (Test, weighted):", f1)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_acc_scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
cv_f1_scores = cross_val_score(model, X, y, cv=cv, scoring='f1_weighted')
print("\nCV Accuracy scores:", cv_acc_scores)
print("Mean CV Accuracy:", np.mean(cv_acc_scores), "Std:", np.std(cv_acc_scores))
print("\nCV F1-weighted scores:", cv_f1_scores)
print("Mean CV F1-weighted:", np.mean(cv_f1_scores), "Std:", np.std(cv_f1_scores))

print("\nClassification Report (Test):\n", classification_report(y_test, test_preds))


### Visualizations

Confusion Matrix:

![](confusion_matrix_stratified.png)

ROC Curve:

![](roc_curve_stratified.png)



### Results

Mushroom Classification - Logistic Regression (Improved Baseline)  
  
Train Accuracy: 0.875000  
Test Accuracy: 1.000000  
F1 Score (Test, weighted): 1.000000  
  
Cross-validation (5-fold Stratified):  
Accuracy scores: [0.5, 1. , 1. , 1. , 1. ]  
Mean CV Accuracy: 0.9000  Std: 0.2000  
  
F1 (weighted) CV scores: [0.6667, 1.    , 1.    , 1.    , 1.    ]  
Mean CV F1-weighted: 0.9333  Std: 0.1333  
  
Classification Report (Test):  
              precision    recall  f1-score   support  
  
           e       1.00      1.00      1.00         1  
           p       1.00      1.00      1.00         1  
  
    accuracy                           1.00         2  
   macro avg       1.00      1.00      1.00         2  
weighted avg       1.00      1.00      1.00         2  
  
