# Empirical Analysis of Supervised Learning Algorithms Across Diverse Data Domains #
**Author:** Anthony Mitine \
**Course:** COGS 118A. Supervised Machine Learning Algorithms\
**Date:** December 11, 2025 

## Overview ##

In this study, I empirically evaluated the performance of five distinct supervised learning classifiers, Random Forest, XGBoost, Support Vector Machines (SVM), Logistic Regression, and K-Nearest Neighbors (KNN), across four datasets of varying dimensionality and domain (Adult, Letter Recognition, Spambase, and Mushroom). Following the experimental protocols established by Caruana and Niculescu-Mizil, I analyzed the impact of training set size on generalization error. My results demonstrated that ensemble methods (XGBoost and Random Forest) consistently outperform single estimators on complex tabular data, while geometric methods (SVM and KNN) excel in spatial domains like character recognition. Furthermore, I confirmed that increasing training data volume yields diminishing returns, with the most significant gains observed in lower-data regimes.

**Link to Report**: https://drive.google.com/file/d/1PMA40ckfMYAHLKYzF6gttpJsSe3N4vKU/view?usp=sharing

In [1]:
# Importing all needed libraries
import warnings
# filter out the specific threadpoolctl warning
warnings.filterwarnings("ignore", category=RuntimeWarning, module="threadpoolctl")

import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
import seaborn as sns
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [2]:
# Importing all needed classifiers
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

## Experiment Configuration

Here we define the experimental hyperparameters. We select four distinct datasets from the UCI repository (Adult, Letter, Spambase, Mushroom) to ensure diversity in problem domain. We also initialize the five classifiers (Random Forest, XGBoost, SVM, Logistic Regression, KNN) and define their hyperparameter search grids for the cross-validation process.

In [3]:
# Define the 4 datasets to use (Name, UCI ID)

datasets_config = {
    'adult': 2,       
    'letter': 59,     
    'spambase': 94,   
    'mushroom': 73    
}

# Partitions and Trials
partitions = [0.2, 0.5, 0.8] 
num_trials = 3 

# 5 classifiers 
model_param_grids = {
    'random_forest': {
        'model': RandomForestClassifier(n_jobs=-1, random_state=42),
        'params': {
            'classifier__n_estimators': [50, 100],
            'classifier__max_depth': [None, 10, 20],
            'classifier__min_samples_split': [2, 10]
        }
    },
    'xgboost': {
        'model': XGBClassifier(eval_metric='logloss', n_jobs=-1, random_state=42),
        'params': {
            'classifier__n_estimators': [50, 100],
            'classifier__learning_rate': [0.1, 0.3],
            'classifier__max_depth': [3, 6]
        }
    },
    'svm': {
        'model': SVC(random_state=42, cache_size=1000),
        'params': {
            'classifier__C': [0.1, 1, 10],
            'classifier__kernel': ['rbf'] 
        }
    },
    'logistic_regression': {
        'model': LogisticRegression(max_iter=1000, random_state=42),
        'params': {
            'classifier__C': [0.1, 1, 10],
            'classifier__solver': ['lbfgs', 'liblinear']
        }
    },
    'knn': {
        'model': KNeighborsClassifier(n_jobs=-1),
        'params': {
            'classifier__n_neighbors': [3, 5, 9],
            'classifier__weights': ['uniform', 'distance']
        }
    }
}

## Data Loading & Preprocessing

This function handles the extract, transform, and load process. It fetches datasets dynamically using the ucimlrepo API and applies dataset-specific preprocessing. This includes converting multi-class targets into binary classification problems (e.g., Letter 'A' vs. Not 'A') and cleaning missing values to prepare the feature matrices for training.

In [4]:
# Data loading   

def load_and_prep_data(name, uci_id):
    print(f"\n[data load] fetching {name} (id: {uci_id})...")
    
    try:
        dataset = fetch_ucirepo(id=uci_id)
        x = dataset.data.features
        y = dataset.data.targets

        # Preprocessing logic 
        
        # 1. Adult dataset 
        if name == 'adult':
            y = y.iloc[:, 0].astype(str) 
            y = y.apply(lambda val: 1 if '>50K' in val else 0)
            combined = pd.concat([x, y], axis=1).dropna()
            combined = combined.replace('?', np.nan).dropna()
            if len(combined) > 10000:
                 combined = combined.sample(10000, random_state=42)
            x = combined.iloc[:, :-1]
            y = combined.iloc[:, -1]

       # 2. Letter recognition 
        elif name == 'letter':
            y = y.iloc[:, 0]
            y = (y == 'A').astype(int)

        # 3. Spambase 
        elif name == 'spambase':
            y = y.iloc[:, 0]
            # Ensure no nans
            combined = pd.concat([x, y], axis=1).dropna()
            x = combined.iloc[:, :-1]
            y = combined.iloc[:, -1]

        # 4. Mushroom 
        elif name == 'mushroom':
            y = y.iloc[:, 0]
            # target is 'e' (edible) or 'p' (poisonous). 
            y = (y == 'p').astype(int)
            
            combined = pd.concat([x, y], axis=1).replace('?', np.nan).dropna()
            x = combined.iloc[:, :-1]
            y = combined.iloc[:, -1]

        return x, y

    except Exception as e:
        print(f"error loading {name}: {e}")
        return None, None

## Pipeline Construction

I define a function to build Scikit-Learn pipelines. This combines preprocessing steps with the chosen classifier. This ensures that transformations are learned only from the training fold.

In [5]:
# Pipeline helper function

def get_pipeline(model_name, x):
    """
    creates a scikit-learn pipeline with preprocessing + classifier.
    """
    # Identify categorical and numerical columns
    cat_cols = x.select_dtypes(include=['object', 'category']).columns
    num_cols = x.select_dtypes(include=['int64', 'float64']).columns

    # Preprocessor: scale numeric data, encode categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), num_cols),
            ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
        ]
    )

    # Combine preprocessor with the specific classifier
    clf = model_param_grids[model_name]['model']
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', clf)
    ])
    
    return pipeline

## Main Loop

The main loop iterates through all combinations of datasets, classifiers, and data partitions (20%, 50%, 80% training splits). For each combination, it performs 3-fold cross-validation to tune hyperparameters, trains the optimal model, and records the average training and testing accuracy over multiple trials to ensure statistical reliability.

In [6]:
# Main loop 

def run_experiments():
    results = []

    # Datasets loop: 
    for data_name, data_id in datasets_config.items():
        x, y = load_and_prep_data(data_name, data_id)
        if x is None: continue

        print(f"  dataset shape: {x.shape}")
        
        # Partitions loop (training size ratios)
        for train_size in partitions:
            test_size = 1.0 - train_size
            print(f"\n  --- partition: train {int(train_size*100)}% / test {int(test_size*100)}% ---")
            
            # Classifiers loop
            for model_name in model_param_grids.keys():
                print(f"    running {model_name}...")
                
                trial_scores_test = []
                trial_scores_train = []
                
                # Trials loop (repeat 3 times)
                for trial in range(num_trials):
                    # 1. Split data
                    x_train, x_test, y_train, y_test = train_test_split(
                        x, y, train_size=train_size, random_state=42+trial, stratify=y
                    )
                    
                    # 2. Setup pipeline
                    pipeline = get_pipeline(model_name, x_train)
                    
                    # 3. Hyperparameter tuning (cross-validation on train set only)
                    param_grid = model_param_grids[model_name]['params']
                    search = RandomizedSearchCV(
                        pipeline, 
                        param_distributions=param_grid, 
                        n_iter=3,
                        cv=3, 
                        scoring='accuracy', 
                        n_jobs=-1,
                        random_state=42
                    )
                    
                    # 4. Train
                    search.fit(x_train, y_train)
                    
                    # 5. Evaluate (best model)
                    best_model = search.best_estimator_
                    
                    # Check train accuracy (sanity check)
                    train_pred = best_model.predict(x_train)
                    train_acc = accuracy_score(y_train, train_pred)
                    
                    # Check test accuracy (real metric)
                    test_pred = best_model.predict(x_test)
                    test_acc = accuracy_score(y_test, test_pred)
                    
                    trial_scores_train.append(train_acc)
                    trial_scores_test.append(test_acc)
                
                # Average results across trials
                avg_train = np.mean(trial_scores_train)
                avg_test = np.mean(trial_scores_test)
                
                print(f"      -> avg test acc: {avg_test:.4f} | avg train acc: {avg_train:.4f}")
                
                # Store for report
                results.append({
                    'dataset': data_name,
                    'partition_train_pct': train_size * 100,
                    'classifier': model_name,
                    'avg_train_acc': avg_train,
                    'avg_test_acc': avg_test,
                    'best_params': search.best_params_
                })

    # Report generation
    print("\n" + "="*60)
    print("final results summary")
    print("="*60)
    
    # Create dataframe for clean display
    results_df = pd.DataFrame(results)
    
    # Sort for easier reading
    results_df = results_df.sort_values(by=['dataset', 'partition_train_pct', 'classifier'])
    
    print(results_df[['dataset', 'partition_train_pct', 'classifier', 'avg_train_acc', 'avg_test_acc']].to_string(index=False))
    
    # Save to csv
    results_df.to_csv('experiment_results.csv', index=False)

# Execute the function
run_experiments()


[data load] fetching adult (id: 2)...
  dataset shape: (10000, 14)

  --- partition: train 20% / test 80% ---
    running random_forest...
      -> avg test acc: 0.8506 | avg train acc: 0.9185
    running xgboost...
      -> avg test acc: 0.8549 | avg train acc: 0.8758
    running svm...
      -> avg test acc: 0.8443 | avg train acc: 0.8727
    running logistic_regression...
      -> avg test acc: 0.8453 | avg train acc: 0.8517
    running knn...
      -> avg test acc: 0.8196 | avg train acc: 0.9998

  --- partition: train 50% / test 50% ---
    running random_forest...
      -> avg test acc: 0.8598 | avg train acc: 0.9099
    running xgboost...
      -> avg test acc: 0.8620 | avg train acc: 0.8797
    running svm...
      -> avg test acc: 0.8511 | avg train acc: 0.8727
    running logistic_regression...
      -> avg test acc: 0.8486 | avg train acc: 0.8560
    running knn...
      -> avg test acc: 0.8269 | avg train acc: 0.9999

  --- partition: train 80% / test 19% ---
    running r

## Experimental Results Summary: ##
The execution of the pipeline yielded definitive performance rankings across our diverse data domains.

* **Adult Dataset**: The gradient-boosted ensemble (XGBoost) proved superior, achieving the highest testing accuracy of 86.7% with 80% training data,  performing better than Random Forest (85.8%) and SVM (85.6%). This confirms the strength of boosting on heterogeneous tabular data.
  
* **Letter Recognition**: The Support Vector Machine (SVM) achieved near-perfect performance (99.9%) with 80% training data, slightly outperforming the tree-based ensembles (99.8%). This highlights the efficacy of kernel methods in geometric/pixel-based feature spaces.

* **Spambase**: Random Forest took the lead here with 95.0% accuracy with 80% training data, demonstrating that bagging methods are highly effective for high-dimensional text feature spaces, outperforming the logistic regression baseline (92.1%) by a significant margin.

* **Mushroom**: All classifiers achieved effectively 100% accuracy by the 50% training split, with Logistic Regression and Random Forest achieving >99.9% even at the 20% split. This confirms the dataset contains deterministic rules easily captured by all model types.

* **General Trend**: Across all classifiers, increasing the training partition from 20% to 80% resulted in consistent accuracy gains, though the "diminishing returns" effect was observed, where the leap from 20% to 50% was often larger than from 50% to 80%.