# Anand Mysorekar

## Cogs 118a Final Project

# Abstract

This project evaluates the performance of three classification models—Random Forest, Support Vector Machine (SVM), and Logistic Regression—on three datasets from the UCI Machine Learning Repository. Each dataset is structured for binary classification, and the models are trained using hyperparameter tuning and evaluated under varying data partitions (20/80, 50/50, 80/20 splits). Metrics such as training, validation, and test accuracy are used to compare the models' effectiveness. Results show that Random Forest consistently achieves the highest accuracy across datasets, while SVM and Logistic Regression demonstrate varying strengths depending on the data's complexity and feature distribution. This study highlights the importance of selecting appropriate classifiers and hyperparameters for different datasets and provides insights into model behavior under different training-to-testing ratios.



# Introduction

Classification is a fundamental task in machine learning, where the goal is to assign inputs to predefined categories. Binary classification, in particular, has widespread applications such as identifying spam emails, diagnosing medical conditions, and predicting customer churn. Selecting the appropriate classification model is critical for achieving high accuracy and robustness across different datasets.

The performance of a classifier often depends on the dataset's characteristics, such as the number of features, the size of the training set, and the feature distributions. Comparing multiple classifiers on the same datasets provides insights into their strengths, weaknesses, and generalizability.

This project aims to evaluate and compare the performance of three popular classification models on three classification datasets. By exploring their behavior under different training/testing splits and tuning hyperparameters, this study seeks to identify general trends and best practices in classifier selection.

# Data Preprocessing

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay

## Dataset 1: [Adult](http://archive.ics.uci.edu/dataset/2/adult)
Predicting whether annual income of an individual exceeds $50K/yr based on census data. 48,842 instances, 14 features.


### Load data and assign column names

In [None]:
column_names = [
    "age",
    "class",
    "fnlwgt", # drop
    "education_level",
    "education-num", # drop
    "marital_status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital_gains",
    "capital_loss",
    "hours_per_week",
    "native_country",
    "salary"
]


adult_dataset = pd.read_csv('adult_dataset/adult.data', names=column_names) # read the dataset
adult_dataset = adult_dataset.drop(columns=["fnlwgt", "education-num"]) # drop these columns
adult_dataset = adult_dataset[~adult_dataset.map(lambda x: str(x).strip()).isin(['?']).any(axis=1)] # remove rows with missing values

print(adult_dataset.dtypes)
adult_dataset.head()

Most of the features are categorical, so we will need to encode them before training the models. Many of the features have lots of possible values, so we will group them into fewer categories to reduce the dimensionality of the data.


### Group similar values for dimensionality reduction for effecient encoding

In [None]:
class_mapping = {
    'Private': 'Private',
    'Self-emp-not-inc': 'Self-employed',
    'Self-emp-inc': 'Self-employed',
    'Federal-gov': 'Government',
    'Local-gov': 'Government',
    'State-gov': 'Government',
    'Without-pay': 'Other',
    'Never-worked': 'Other'
}

adult_dataset['class'] = adult_dataset['class'].str.strip()
adult_dataset['class'] = adult_dataset['class'].map(class_mapping)
print(adult_dataset['class'].unique())


education_mapping = {
    'Bachelors': 'Undergraduate',
    'Some-college': 'Undergraduate',
    'Assoc-acdm': 'Undergraduate',
    'Assoc-voc': 'Undergraduate',
    'Masters': 'Postgraduate',
    'Doctorate': 'Postgraduate',
    'Prof-school': 'Postgraduate',
    'HS-grad': 'Lower Education',
    '12th': 'Lower Education',
    '11th': 'Lower Education',
    '10th': 'Lower Education',
    '9th': 'Lower Education',
    '7th-8th': 'Lower Education',
    '5th-6th': 'Lower Education',
    '1st-4th': 'Lower Education',
    'Preschool': 'Lower Education'
}

adult_dataset['education_level'] = adult_dataset['education_level'].str.strip()
adult_dataset['education_level'] = adult_dataset['education_level'].map(education_mapping)
print(adult_dataset['education_level'].unique())


marital_mapping = {
    'Never-married': 'Single',
    'Married-civ-spouse': 'Married',
    'Married-AF-spouse': 'Married',
    'Divorced': 'Previously Married',
    'Separated': 'Previously Married',
    'Widowed': 'Previously Married',
    'Married-spouse-absent': 'Married'
}

adult_dataset['marital_status'] = adult_dataset['marital_status'].str.strip()
adult_dataset['marital_status'] = adult_dataset['marital_status'].map(marital_mapping)
print(adult_dataset['marital_status'].unique())


occupation_mapping = {
    'Tech-support': 'White-collar',
    'Craft-repair': 'Blue-collar',
    'Other-service': 'Service',
    'Sales': 'White-collar',
    'Exec-managerial': 'White-collar',
    'Prof-specialty': 'White-collar',
    'Handlers-cleaners': 'Blue-collar',
    'Machine-op-inspct': 'Blue-collar',
    'Adm-clerical': 'White-collar',
    'Farming-fishing': 'Blue-collar',
    'Transport-moving': 'Blue-collar',
    'Priv-house-serv': 'Service',
    'Protective-serv': 'Service',
    'Armed-Forces': 'Military'
}

adult_dataset['occupation'] = adult_dataset['occupation'].str.strip()
adult_dataset['occupation'] = adult_dataset['occupation'].map(occupation_mapping)
print(adult_dataset['occupation'].unique())


relationship_mapping = {
    'Wife': 'Spouse',
    'Husband': 'Spouse',
    'Own-child': 'Dependent',
    'Not-in-family': 'Unrelated',
    'Other-relative': 'Dependent',
    'Unmarried': 'Unrelated'
}

adult_dataset['relationship'] = adult_dataset['relationship'].str.strip()
adult_dataset['relationship'] = adult_dataset['relationship'].map(relationship_mapping)
print(adult_dataset['relationship'].unique())


race_mapping = {
    'White': 'White',
    'Black': 'Black',
    'Asian-Pac-Islander': 'Asian',
    'Amer-Indian-Eskimo': 'Indigenous',
    'Other': 'Other'
}

adult_dataset['race'] = adult_dataset['race'].str.strip()
adult_dataset['race'] = adult_dataset['race'].map(race_mapping)
print(adult_dataset['race'].unique())


sex_mapping = {
    'Male': 1,
    'Female': 0
}

adult_dataset['sex'] = adult_dataset['sex'].str.strip()
adult_dataset['sex'] = adult_dataset['sex'].map(sex_mapping)
print(adult_dataset['sex'].unique())


country_mapping = {
    'United-States': 'North America',
    'Canada': 'North America',
    'Outlying-US(Guam-USVI-etc)': 'North America',
    'Puerto-Rico': 'North America',
    'Mexico': 'Latin America',
    'Cuba': 'Latin America',
    'Dominican-Republic': 'Latin America',
    'Jamaica': 'Latin America',
    'Haiti': 'Latin America',
    'Trinadad&Tobago': 'Latin America',
    'El-Salvador': 'Latin America',
    'Guatemala': 'Latin America',
    'Honduras': 'Latin America',
    'Nicaragua': 'Latin America',
    'Ecuador': 'Latin America',
    'Peru': 'Latin America',
    'Columbia': 'Latin America',
    'England': 'Europe',
    'Germany': 'Europe',
    'Italy': 'Europe',
    'Poland': 'Europe',
    'Portugal': 'Europe',
    'Ireland': 'Europe',
    'France': 'Europe',
    'Greece': 'Europe',
    'Scotland': 'Europe',
    'Yugoslavia': 'Europe',
    'Hungary': 'Europe',
    'Holand-Netherlands': 'Europe',
    'Cambodia': 'Asia',
    'India': 'Asia',
    'Japan': 'Asia',
    'China': 'Asia',
    'Philippines': 'Asia',
    'Vietnam': 'Asia',
    'Laos': 'Asia',
    'Thailand': 'Asia',
    'Hong': 'Asia',
    'Taiwan': 'Asia',
    'Iran': 'Middle East',
    'South': 'Other',
    'Israel': 'Middle East',  
    'Other': 'Other'
}

adult_dataset['native_country'] = adult_dataset['native_country'].str.strip()
adult_dataset['native_country'] = adult_dataset['native_country'].map(country_mapping)
print(adult_dataset['native_country'].unique())


salary_mapping = {
    '>50K': 1,
    '<=50K': 0
}

adult_dataset['salary'] = adult_dataset['salary'].str.strip()
adult_dataset['salary'] = adult_dataset['salary'].map(salary_mapping)
print(adult_dataset['salary'].unique())

In [None]:
adult_dataset.head()

That looks much better and much more manageable. Now we can encode the data. Most of the features are nominal, meaning they don't have an inherent order, so we will use one-hot encoding to encode them. The education level variable, however, is ordinal (meaning there is some order, but the distances between them don't necessarily mean anything), so we will use label encoding for that variable.

### Encode categorical variables

In [None]:
columns_to_one_hot = ['class', 'marital_status', 'occupation', 'relationship', 'race', 'native_country']
adult_dataset = pd.get_dummies(adult_dataset, columns=columns_to_one_hot, prefix=columns_to_one_hot)

label_encoder = LabelEncoder()
adult_dataset['education_level'] = label_encoder.fit_transform(adult_dataset['education_level'])


In [None]:
print(adult_dataset.dtypes)
adult_dataset.head()


Now that each feature is in the format a machine can handle and intepret properly, the data is ready for modeling!

## Dataset 2: [Heart Disease](http://archive.ics.uci.edu/dataset/45/heart+disease)
Heart disease dataset from four databases predicting if a person has heart disease based on various features. 3 of the databases were riddled with missing values, so we will use the Cleveland database. 303 instances, 14 features.


### Load data and assign column names


In [None]:
column_names = ["age", "sex", "cp", "resting_bp", "cholesterol", "blood_sugar", "resting_ekg", "max_hr", "exang", "oldpeak", "slope", "ca", "thal", "severity"]


heart_dataset = pd.read_csv('heart_dataset/processed.cleveland.data', header=None, names=column_names) # read the dataset
heart_dataset = heart_dataset[~heart_dataset.map(lambda x: str(x).strip()).isin(['?']).any(axis=1)] # remove rows with missing values
heart_dataset['ca'] = heart_dataset['ca'].astype(float) # convert to float
heart_dataset['thal'] = heart_dataset['thal'].astype(float) # convert to float
heart_dataset['severity'] = heart_dataset['severity'].apply(lambda x: 0 if x == 0 else 1) # convert to binary for 2 class classification

print(heart_dataset.dtypes)
heart_dataset.head()

This dataset is already clean and ready for modeling, so we are done!

## Dataset 3:
Wine dataset; predicting the quality of wine based on various features.


### Load data and assign column names

Since the qualities of a good red wine might be different from those of a good white wine, I will split the dataset into two and model them separately, as well as model the combined dataset.

In [None]:
red_wine_dataset = pd.read_csv('wine_dataset/winequality-red.csv', sep=';') # read the dataset
white_wine_dataset = pd.read_csv('wine_dataset/winequality-white.csv', sep=';') # read the dataset
red_white_wine_dataset = pd.concat([red_wine_dataset, white_wine_dataset], axis=0) # concatenate the datasets
wine_datasets = [red_wine_dataset, white_wine_dataset, red_white_wine_dataset] 

for i, dataset in enumerate(wine_datasets, start=1):
    dataset['quality'] = dataset['quality'].apply(lambda x: 0 if x <= 6 else 1) # convert to binary for 2 class classification

In [None]:
print(red_wine_dataset.dtypes)
red_wine_dataset.head()

In [None]:
print(red_wine_dataset.dtypes)
white_wine_dataset.head()

In [None]:
print(red_white_wine_dataset.dtypes)
red_white_wine_dataset.head()

This dataset is already clean and ready for modeling. 

# Methods and Models

I am going to use the following models for each dataset: Support Vector Machine, Random Forest, and Logistic Regression. For each of the models I will:

- Preprocess the data
    - Scale the data using StandardScaler to standardize the input values for better model performance
    - Separate the data into features and target column

- Partition the data into training and testing sets
    - Use 20/80, 50/50, and 80/20 splits to evaluate model performance under different training/testing ratios

- Perform hyperparameter tuning using GridSearchCV
    - Tune the hyperparameters for each model to optimize performance

- Train the model
    - Fit the model on the training data for each partition using the best hyperparameters

- Evaluate the model
    - Evaluate the model using cross scores to compute the average validation accuracy
    - Evaluate the model on the test set to compute the test accuracy
    - Generate classification reports to analyze precision, recall, and F1 scores



## Support Vector Machine 

In [None]:
def train_svm_with_partitions(df, target_column):

    # Prepare features and target
    X = df.drop(target_column, axis=1)
    y = df[target_column]
    
    # Scale features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Define partitions (test/train split ratios)
    partitions = {
        "20/80": 0.2,
        "50/50": 0.5,
        "80/20": 0.8
    }
    
    results = []  # Store results for analysis
    for partition_name, test_size in tqdm(partitions.items(), desc="Partitions"):
        print(f"\nTraining with {partition_name} partition...")
        
        # Split data into training and testing
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

        # Define hyperparameter grid
        param_grid = {
            'C': [0.1, 1, 10], 
            'gamma': ['scale', 0.1],  
            'kernel': ['rbf', 'linear']
        }

        # Perform grid search with cross-validation
        grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
        grid_search.fit(X_train, y_train)
        
        # Get the best model
        best_model = grid_search.best_estimator_
        print(f"Best parameters: {grid_search.best_params_}")
        
        # Evaluate cross-validation accuracy on training set
        cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='accuracy')
        avg_cv_score = np.mean(cv_scores)
        print(f"Average cross-validation accuracy: {avg_cv_score:.2f}")
        
        # Evaluate on test set
        best_model.fit(X_train, y_train)
        y_pred = best_model.predict(X_test)
        test_accuracy = accuracy_score(y_test, y_pred)
        print(f"Test accuracy: {test_accuracy:.2f}")
        print("Classification Report:")
        print(classification_report(y_test, y_pred))
        
        # Store results
        results.append({
            "Partition": partition_name,
            "CV Accuracy": avg_cv_score,
            "Test Accuracy": test_accuracy
        })
        
        # Visualize confusion matrix
        ConfusionMatrixDisplay.from_estimator(best_model, X_test, y_test)
        plt.title(f"Confusion Matrix: {partition_name} Partition")
        plt.show()

    # Display summarized results
    results_df = pd.DataFrame(results)
    print("\nSummary of Results:")
    print(results_df)
    
    # Plot training vs. testing accuracies
    plt.figure(figsize=(8, 6))
    plt.plot(results_df["Partition"], results_df["CV Accuracy"], label="Validation Accuracy", marker='o')
    plt.plot(results_df["Partition"], results_df["Test Accuracy"], label="Test Accuracy", marker='o')
    plt.xlabel("Partition")
    plt.ylabel("Accuracy")
    plt.title("SVM Performance Across Partitions")
    plt.legend()
    plt.grid()
    plt.show()

## Random Forest

In [None]:
def train_random_forest_with_partitions(df, target_column):

    # Prepare features and target
    X = df.drop(target_column, axis=1)
    y = df[target_column]
    
    # Scale features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Define partitions
    partitions = {
        "20/80": 0.2,
        "50/50": 0.5,
        "80/20": 0.8
    }
    
    results = []  # Store results for analysis
    for partition_name, test_size in tqdm(partitions.items(), desc="Partitions"):
        print(f"\nTraining with {partition_name} partition...")
        
        # Split data into training and testing
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

        # Define hyperparameter grid
        param_grid = {
            'n_estimators': [50, 100, 200],  
            'max_depth': [None, 10, 20], 
            'min_samples_split': [2, 5, 10],
            'bootstrap': [True, False]
        }

        # Perform grid search with cross-validation
        grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
        grid_search.fit(X_train, y_train)
        
        # Get the best model
        best_model = grid_search.best_estimator_
        print(f"Best parameters: {grid_search.best_params_}")
        
        # Evaluate cross-validation accuracy on training set
        cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='accuracy')
        avg_cv_score = np.mean(cv_scores)
        print(f"Average cross-validation accuracy: {avg_cv_score:.2f}")
        
        # Evaluate on test set
        best_model.fit(X_train, y_train)
        y_pred = best_model.predict(X_test)
        test_accuracy = accuracy_score(y_test, y_pred)
        print(f"Test accuracy: {test_accuracy:.2f}")
        print("Classification Report:")
        print(classification_report(y_test, y_pred))
        
        # Store results
        results.append({
            "Partition": partition_name,
            "CV Accuracy": avg_cv_score,
            "Test Accuracy": test_accuracy
        })
        
        # Visualize confusion matrix
        ConfusionMatrixDisplay.from_estimator(best_model, X_test, y_test)
        plt.title(f"Confusion Matrix: {partition_name} Partition")
        plt.show()

    # Display summarized results
    results_df = pd.DataFrame(results)
    print("\nSummary of Results:")
    print(results_df)
    
    # Plot training vs. testing accuracies
    plt.figure(figsize=(8, 6))
    plt.plot(results_df["Partition"], results_df["CV Accuracy"], label="Validation Accuracy", marker='o')
    plt.plot(results_df["Partition"], results_df["Test Accuracy"], label="Test Accuracy", marker='o')
    plt.xlabel("Partition")
    plt.ylabel("Accuracy")
    plt.title("Random Forest Performance Across Partitions")
    plt.legend()
    plt.grid()
    plt.show()

## Logistic Regression

In [None]:
def train_logistic_regression_with_partitions(df, target_column):

    # Prepare features and target
    X = df.drop(target_column, axis=1)
    y = df[target_column]
    
    # Scale features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Define partitions
    partitions = {
        "20/80": 0.2,
        "50/50": 0.5,
        "80/20": 0.8
    }
    
    results = []  # Store results for analysis
    for partition_name, test_size in tqdm(partitions.items(), desc="Partitions"):
        print(f"\nTraining with {partition_name} partition...")
        
        # Split data into training and testing
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)

        # Define hyperparameter grid
        param_grid = {
            'C': [0.1, 1, 10, 100],  
            'penalty': ['l1', 'l2'],  
            'solver': ['liblinear', 'saga'] ,
            'fit_intercept': [True, False]
        }

        # Perform grid search with cross-validation
        grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
        grid_search.fit(X_train, y_train)
        
        # Get the best model
        best_model = grid_search.best_estimator_
        print(f"Best parameters: {grid_search.best_params_}")
        
        # Evaluate cross-validation accuracy on training set
        cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='accuracy')
        avg_cv_score = np.mean(cv_scores)
        print(f"Average cross-validation accuracy: {avg_cv_score:.2f}")
        
        # Evaluate on test set
        best_model.fit(X_train, y_train)
        y_pred = best_model.predict(X_test)
        test_accuracy = accuracy_score(y_test, y_pred)
        print(f"Test accuracy: {test_accuracy:.2f}")
        print("Classification Report:")
        print(classification_report(y_test, y_pred))
        
        # Store results
        results.append({
            "Partition": partition_name,
            "CV Accuracy": avg_cv_score,
            "Test Accuracy": test_accuracy
        })
        
        # Visualize confusion matrix
        ConfusionMatrixDisplay.from_estimator(best_model, X_test, y_test)
        plt.title(f"Confusion Matrix: {partition_name} Partition")
        plt.show()

    # Display summarized results
    results_df = pd.DataFrame(results)
    print("\nSummary of Results:")
    print(results_df)
    
    # Plot training vs. testing accuracies
    plt.figure(figsize=(8, 6))
    plt.plot(results_df["Partition"], results_df["CV Accuracy"], label="Validation Accuracy", marker='o')
    plt.plot(results_df["Partition"], results_df["Test Accuracy"], label="Test Accuracy", marker='o')
    plt.xlabel("Partition")
    plt.ylabel("Accuracy")
    plt.title("Logistic Regression Performance Across Partitions")
    plt.legend()
    plt.grid()
    plt.show()

# Experiments

### Running SVM on all the datasets

In [None]:
datasets = [
    {"name": "Adult Dataset", "data": adult_dataset, "target": "salary"},
    {"name": "Heart Dataset", "data": heart_dataset, "target": "severity"},
    {"name": "Red Wine Dataset", "data": red_wine_dataset, "target": "quality"},
    {"name": "White Wine Dataset", "data": white_wine_dataset, "target": "quality"},
    {"name": "Red and White Wine Dataset", "data": red_white_wine_dataset, "target": "quality"}
]

for i, dataset_info in enumerate(tqdm(datasets, desc="Datasets"), start=1):
    dataset_name = dataset_info["name"]
    dataset = dataset_info["data"]
    target_column = dataset_info["target"]
    print(f"\nTraining on '{dataset_name}' with target column '{target_column}'...")
    train_svm_with_partitions(dataset, target_column)

### SVM Analysis
type here

### Running random forest on all the datasets

In [None]:
datasets = [
    {"name": "Adult Dataset", "data": adult_dataset, "target": "salary"},
    {"name": "Heart Dataset", "data": heart_dataset, "target": "severity"},
    {"name": "Red Wine Dataset", "data": red_wine_dataset, "target": "quality"},
    {"name": "White Wine Dataset", "data": white_wine_dataset, "target": "quality"},
    {"name": "Red and White Wine Dataset", "data": red_white_wine_dataset, "target": "quality"}
]

for i, dataset_info in enumerate(tqdm(datasets, desc="Datasets"), start=1):
    dataset_name = dataset_info["name"]
    dataset = dataset_info["data"]
    target_column = dataset_info["target"]
    print(f"\nTraining on '{dataset_name}' with target column '{target_column}'...")
    train_random_forest_with_partitions(dataset, target_column)

### Random Forest Analysis
type here

### Running logistic regression on all the datasets

In [None]:
datasets = [
    {"name": "Adult Dataset", "data": adult_dataset, "target": "salary"},
    {"name": "Heart Dataset", "data": heart_dataset, "target": "severity"},
    {"name": "Red Wine Dataset", "data": red_wine_dataset, "target": "quality"},
    {"name": "White Wine Dataset", "data": white_wine_dataset, "target": "quality"},
    {"name": "Red and White Wine Dataset", "data": red_white_wine_dataset, "target": "quality"}
]

for i, dataset_info in enumerate(tqdm(datasets, desc="Datasets"), start=1):
    dataset_name = dataset_info["name"]
    dataset = dataset_info["data"]
    target_column = dataset_info["target"]
    print(f"\nTraining on '{dataset_name}' with target column '{target_column}'...")
    train_logistic_regression_with_partitions(dataset, target_column)

### Logistic Regression Analysis
type here