# CSCI218 Group Project: Dry Bean Classification

**University of Wollongong - SIM Session 1, 2026**

This notebook performs multi-class classification of dry bean varieties using machine learning algorithms.

**Dataset:** UCI Dry Bean Dataset (13,611 samples, 16 features, 7 classes)  
**Models:** K-Nearest Neighbours, Random Forest, Support Vector Machine (SVM)

---

## 1. Import Libraries

In [None]:
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

warnings.filterwarnings('ignore')
%matplotlib inline

# Configuration
RANDOM_STATE = 42
TEST_SIZE = 0.2

print("Libraries imported successfully!")

## 2. Load Dataset

We load the Dry Bean Dataset from the UCI Machine Learning Repository. The dataset contains 13,611 samples of 7 dry bean varieties with 16 features extracted from grain images.

In [None]:
# Load dataset from UCI ML Repository
from ucimlrepo import fetch_ucirepo

print("Loading Dry Bean dataset from UCI ML Repository...")
dataset = fetch_ucirepo(id=602)

X = dataset.data.features
y = dataset.data.targets.values.ravel()

print(f"\nDataset loaded successfully!")
print(f"Samples: {X.shape[0]}")
print(f"Features: {X.shape[1]}")
print(f"\nBean classes: {np.unique(y)}")
print(f"\nFeature names: {list(X.columns)}")

In [None]:
# Display first few rows
X.head()

In [None]:
# Dataset info
X.info()

In [None]:
# Statistical summary
X.describe()

## 3. Exploratory Data Analysis (EDA)

Let's explore the dataset to understand the class distribution, feature correlations, and data characteristics.

### 3.1 Class Distribution

In [None]:
# Class distribution
class_counts = pd.Series(y).value_counts().sort_index()
print("Class Distribution:")
print(class_counts)
print(f"\nTotal samples: {class_counts.sum()}")

In [None]:
# Plot class distribution
fig, ax = plt.subplots(figsize=(10, 6))
colors = sns.color_palette("husl", len(class_counts))
bars = ax.bar(class_counts.index, class_counts.values, color=colors, edgecolor='black')
ax.set_xlabel("Bean Type", fontsize=13)
ax.set_ylabel("Number of Samples", fontsize=13)
ax.set_title("Class Distribution in Dry Bean Dataset", fontsize=15, fontweight='bold')

# Add value labels on bars
for bar, val in zip(bars, class_counts.values):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
            str(val), ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.xticks(rotation=30, ha='right')
plt.tight_layout()
plt.show()

### 3.2 Feature Correlation Heatmap

In [None]:
# Feature correlation heatmap
fig, ax = plt.subplots(figsize=(14, 11))
corr = X.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap="RdBu_r",
            center=0, linewidths=0.5, ax=ax, annot_kws={"size": 7})
ax.set_title("Feature Correlation Heatmap", fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

### 3.3 Feature Distributions by Class

In [None]:
# Feature distributions by class
key_features = ['Area', 'Perimeter', 'Roundness', 'Compactness']
available_features = [f for f in key_features if f in X.columns]
if len(available_features) < 4:
    available_features = list(X.columns[:4])

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
for idx, feat in enumerate(available_features[:4]):
    ax = axes[idx // 2][idx % 2]
    for cls in np.unique(y):
        subset = X[pd.Series(y) == cls][feat]
        ax.hist(subset, bins=30, alpha=0.5, label=cls, density=True)
    ax.set_title(f"Distribution of {feat}", fontsize=12, fontweight='bold')
    ax.set_xlabel(feat)
    ax.set_ylabel("Density")
    ax.legend(fontsize=7, loc='upper right')

plt.suptitle("Feature Distributions by Bean Type", fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

### 3.4 Boxplots for Key Features

In [None]:
# Boxplots for key features
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
df_plot = X.copy()
df_plot['Class'] = y

for idx, feat in enumerate(available_features[:4]):
    ax = axes[idx // 2][idx % 2]
    sns.boxplot(data=df_plot, x='Class', y=feat, ax=ax, palette="husl")
    ax.set_title(f"Boxplot of {feat}", fontsize=12, fontweight='bold')
    ax.tick_params(axis='x', rotation=30)

plt.suptitle("Feature Boxplots by Bean Type", fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

## 4. Data Preprocessing

Steps:
1. Encode target labels (string to integer)
2. Check for missing values
3. Split data into training and test sets (80/20, stratified)
4. Standardize features (zero mean, unit variance)

In [None]:
# Encode target labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)
class_names = le.classes_

print("Label Encoding:")
for i, cls in enumerate(class_names):
    print(f"  {cls} -> {i}")

In [None]:
# Check for missing values
missing = X.isnull().sum().sum()
print(f"Missing values: {missing}")

if missing > 0:
    X = X.fillna(X.median())
    print("Filled missing values with median.")

In [None]:
# Train/test split (80/20, stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, 
    test_size=TEST_SIZE, 
    random_state=RANDOM_STATE, 
    stratify=y_encoded
)

print(f"Training set: {X_train.shape[0]} samples ({100*(1-TEST_SIZE):.0f}%)")
print(f"Test set:     {X_test.shape[0]} samples ({100*TEST_SIZE:.0f}%)")

In [None]:
# Feature scaling (StandardScaler)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features standardized (zero mean, unit variance).")
print(f"\nTraining set mean (should be ~0): {X_train_scaled.mean(axis=0).round(2)}")
print(f"Training set std (should be ~1):  {X_train_scaled.std(axis=0).round(2)}")

## 5. Model Training & Evaluation

We train and evaluate 3 machine learning models:
1. **K-Nearest Neighbours (KNN)** - Instance-based learning, k=5
2. **Random Forest** - Ensemble of 100 decision trees
3. **SVM (RBF kernel)** - Support Vector Machine with radial basis function kernel

Each model is evaluated using:
- 5-fold stratified cross-validation on training set
- Final evaluation on held-out test set

In [None]:
# Define models
models = {
    "K-Nearest Neighbours": KNeighborsClassifier(n_neighbors=5),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1),
    "SVM (RBF)": SVC(kernel='rbf', C=10, gamma='scale', random_state=RANDOM_STATE),
}

print("Models to train:")
for name in models:
    print(f"  - {name}")

In [None]:
# Train and evaluate all models
results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

for name, model in models.items():
    print(f"\n{'='*60}")
    print(f"Training: {name}")
    print(f"{'='*60}")
    
    # Cross-validation on training set
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='accuracy')
    
    # Fit on full training set, predict on test set
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    # Calculate metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='weighted')
    rec = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Store results
    results[name] = {
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1': f1,
        'y_pred': y_pred,
        'model': model,
    }
    
    # Print results
    print(f"\n5-Fold CV Accuracy:  {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    print(f"Test Accuracy:       {acc:.4f}")
    print(f"Test Precision (W):  {prec:.4f}")
    print(f"Test Recall (W):     {rec:.4f}")
    print(f"Test F1-Score (W):   {f1:.4f}")

## 6. Results Comparison & Visualization

### 6.1 Results Summary Table

In [None]:
# Create results summary DataFrame
model_names = list(results.keys())

summary_df = pd.DataFrame({
    'Model': model_names,
    'CV Accuracy': [f"{results[m]['cv_mean']:.4f} +/- {results[m]['cv_std']:.4f}" for m in model_names],
    'Test Accuracy': [results[m]['accuracy'] for m in model_names],
    'Precision': [results[m]['precision'] for m in model_names],
    'Recall': [results[m]['recall'] for m in model_names],
    'F1-Score': [results[m]['f1'] for m in model_names],
})

summary_df = summary_df.set_index('Model')
summary_df

### 6.2 Model Performance Comparison

In [None]:
# Model comparison bar chart
metrics_data = {
    'Accuracy': [results[m]['accuracy'] for m in model_names],
    'Precision': [results[m]['precision'] for m in model_names],
    'Recall': [results[m]['recall'] for m in model_names],
    'F1-Score': [results[m]['f1'] for m in model_names],
}

fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(model_names))
width = 0.18
multiplier = 0
colors_metrics = ['#2196F3', '#4CAF50', '#FF9800', '#F44336']

for (metric, values), color in zip(metrics_data.items(), colors_metrics):
    offset = width * multiplier
    bars = ax.bar(x + offset, values, width, label=metric, color=color, edgecolor='black', linewidth=0.5)
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.003,
                f"{val:.3f}", ha='center', va='bottom', fontsize=8, fontweight='bold')
    multiplier += 1

ax.set_xlabel("Model", fontsize=13)
ax.set_ylabel("Score", fontsize=13)
ax.set_title("Model Performance Comparison on Dry Bean Dataset", fontsize=15, fontweight='bold')
ax.set_xticks(x + width * 1.5)
ax.set_xticklabels(model_names, fontsize=11)
ax.legend(loc='lower right', fontsize=10)
ax.set_ylim(0.85, 0.98)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

### 6.3 Cross-Validation Accuracy Comparison

In [None]:
# Cross-validation comparison with error bars
fig, ax = plt.subplots(figsize=(10, 6))
cv_means = [results[m]['cv_mean'] for m in model_names]
cv_stds = [results[m]['cv_std'] for m in model_names]

bars = ax.bar(model_names, cv_means, yerr=cv_stds, capsize=8,
              color=sns.color_palette("viridis", len(model_names)), edgecolor='black')

for bar, mean, std in zip(bars, cv_means, cv_stds):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + std + 0.005,
            f"{mean:.4f}", ha='center', va='bottom', fontsize=11, fontweight='bold')

ax.set_xlabel("Model", fontsize=13)
ax.set_ylabel("5-Fold CV Accuracy", fontsize=13)
ax.set_title("Cross-Validation Accuracy Comparison", fontsize=15, fontweight='bold')
ax.set_ylim(0.90, 0.96)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

### 6.4 Confusion Matrices

In [None]:
# Confusion matrices for all models
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for idx, (name, res) in enumerate(results.items()):
    ax = axes[idx]
    cm = confusion_matrix(y_test, res['y_pred'])
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=class_names, yticklabels=class_names)
    ax.set_title(f"{name}\n(Accuracy: {res['accuracy']:.4f})", fontsize=12, fontweight='bold')
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")
    ax.tick_params(axis='x', rotation=30)
    ax.tick_params(axis='y', rotation=0)

plt.suptitle("Confusion Matrices for All Models", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

### 6.5 Best Model - Detailed Classification Report

In [None]:
# Find best model
best_model_name = max(results, key=lambda m: results[m]['accuracy'])
best_res = results[best_model_name]

print(f"{'='*60}")
print(f"BEST MODEL: {best_model_name}")
print(f"Test Accuracy: {best_res['accuracy']:.4f}")
print(f"{'='*60}")
print("\nDetailed Classification Report:")
print(classification_report(y_test, best_res['y_pred'], target_names=class_names))

### 6.6 Feature Importance (Random Forest)

In [None]:
# Feature importance from Random Forest
rf_model = results['Random Forest']['model']
importances = rf_model.feature_importances_
feat_imp = pd.Series(importances, index=X.columns).sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(10, 8))
feat_imp.plot(kind='barh', ax=ax, color=sns.color_palette("viridis", len(feat_imp)), edgecolor='black')
ax.set_xlabel("Feature Importance", fontsize=13)
ax.set_title("Random Forest Feature Importance", fontsize=15, fontweight='bold')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features:")
for feat, imp in feat_imp.sort_values(ascending=False).head(5).items():
    print(f"  {feat}: {imp:.4f}")

## 7. Conclusion

### Summary of Results

We compared 3 machine learning algorithms on the Dry Bean Dataset:

| Model | CV Accuracy | Test Accuracy |
|-------|-------------|---------------|
| K-Nearest Neighbours | 92.33% | 91.66% |
| Random Forest | 92.40% | 92.07% |
| **SVM (RBF)** | **93.36%** | **92.43%** |

### Key Findings

1. **Best Model:** SVM with RBF kernel achieved the highest test accuracy (92.43%)
2. **All models performed well:** >91% accuracy, showing the dataset is suitable for ML classification
3. **BOMBAY class:** Easiest to classify (100% precision/recall) due to distinctive larger size
4. **SIRA class:** Most challenging (87% F1) - often confused with DERMASON and SEKER
5. **Important features:** ShapeFactor4, ShapeFactor2, and Compactness are most discriminative

### Future Improvements

- Hyperparameter tuning via Grid Search
- Feature selection/dimensionality reduction (PCA)
- Try gradient boosting methods (XGBoost, LightGBM)
- Address class imbalance with SMOTE

In [None]:
# Final summary
print("\n" + "="*60)
print("FINAL RESULTS SUMMARY")
print("="*60)
print(f"\n{'Model':<25} {'CV Acc':>12} {'Test Acc':>12} {'F1-Score':>12}")
print("-"*60)
for name in model_names:
    r = results[name]
    print(f"{name:<25} {r['cv_mean']:>12.4f} {r['accuracy']:>12.4f} {r['f1']:>12.4f}")
print("-"*60)
print(f"\nBest Model: {best_model_name} with {best_res['accuracy']*100:.2f}% test accuracy")