---
title: "Comprehensive Study on the Impact of Feature Scaling on Classification Models"
author: "Sherry Thomas"
format:
  html:
    theme: theme.scss
    toc: true
    html-math-method: katex
---

## Introduction

In the realm of machine learning, feature scaling is a crucial preprocessing step that can significantly influence the performance of classification models. It involves transforming the data to a common scale, ensuring that no single feature dominates the learning process due to its range of values. This notebook presents an exhaustive exploration of the impact of various feature scaling methods on classification models. We will focus on five commonly used techniques:

1. Standard Scaler
2. Min-max Scaler
3. Maximum Absolute Scaler
4. Robust Scaler
5. Quantile Transformer

We will use four different datasets provided by scikit-learn, which are frequently employed for classification tasks:

1. Iris dataset
2. Digits dataset
3. Wine dataset
4. Breast Cancer dataset

## Importing Necessary Libraries

Before we begin, we need to import the required libraries for data manipulation, visualization, and machine learning.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_digits, load_wine, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, QuantileTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

## Loading the Datasets

We start by loading the four datasets and inspecting their structures.

In [4]:
# Load the datasets
iris = load_iris()
digits = load_digits()
wine = load_wine()
breast_cancer = load_breast_cancer()

# Create DataFrames for the datasets
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] + ['target'])
digits_df = pd.DataFrame(data=np.c_[digits['data'], digits['target']], columns=digits['feature_names'] + ['target'])
wine_df = pd.DataFrame(data=np.c_[wine['data'], wine['target']], columns=wine['feature_names'] + ['target'])
breast_cancer_df = pd.DataFrame(data=np.c_[breast_cancer['data'], breast_cancer['target']], columns=list(breast_cancer['feature_names']) + ['target'])

# Display the first few rows of iris dataset
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [6]:
# Display the first few rows of wine dataset
wine_df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0.0


In [7]:
# Display the first few rows of breast cancer dataset
breast_cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


The datasets contain various features related to their respective domains, with a 'target' column indicating the class labels.

## Data Preprocessing

Before we proceed with feature scaling, we need to split the data for each dataset into training and testing sets. Additionally, to make our study more robust and thorough, we will create noisy versions of the datasets by adding random noise to the feature values. These noisy datasets will introduce variations that can better showcase the effects of different scaling methods on classification model performance.

In [8]:
# Define a function to create noisy datasets
def create_noisy_dataset(dataset, noise_std=0.2, test_size=0.2, random_state=42):
    X = dataset.data
    y = dataset.target

    np.random.seed(random_state)
    noise = np.random.normal(0, noise_std, size=X.shape)
    X_noisy = X + noise

    X_train_noisy, X_test_noisy, y_train, y_test = train_test_split(X_noisy, y, test_size=test_size, random_state=random_state)

    return X_train_noisy, X_test_noisy, y_train, y_test

# Create noisy datasets for all four datasets
X_train_iris_noisy, X_test_iris_noisy, y_train_iris, y_test_iris = create_noisy_dataset(iris)
X_train_digits_noisy, X_test_digits_noisy, y_train_digits, y_test_digits = create_noisy_dataset(digits)
X_train_wine_noisy, X_test_wine_noisy, y_train_wine, y_test_wine = create_noisy_dataset(wine)
X_train_breast_cancer_noisy, X_test_breast_cancer_noisy, y_train_breast_cancer, y_test_breast_cancer = create_noisy_dataset(breast_cancer)

## Feature Scaling Methods

### 1. Standard Scaler

The Standard Scaler ($SS$) transforms the data so that it has a mean ($\mu$) of 0 and a standard deviation ($\sigma$) of 1. This method assumes that the data is normally distributed. The transformation is given by:

$$
SS(x) = \frac{x - \mu}{\sigma}
$$

where $x$ is the original feature vector, $\mu$ is the mean of the feature vector, and $\sigma$ is the standard deviation of the feature vector.

In [9]:
# Define a function to apply Standard Scaler to a dataset
def apply_standard_scaler(X_train, X_test):
    standard_scaler = StandardScaler()
    X_train_scaled = standard_scaler.fit_transform(X_train)
    X_test_scaled = standard_scaler.transform(X_test)
    return X_train_scaled, X_test_scaled

# Apply Standard Scaler to all four datasets
X_train_iris_standard, X_test_iris_standard = apply_standard_scaler(X_train_iris_noisy, X_test_iris_noisy)
X_train_digits_standard, X_test_digits_standard = apply_standard_scaler(X_train_digits_noisy, X_test_digits_noisy)
X_train_wine_standard, X_test_wine_standard = apply_standard_scaler(X_train_wine_noisy, X_test_wine_noisy)
X_train_breast_cancer_standard, X_test_breast_cancer_standard = apply_standard_scaler(X_train_breast_cancer_noisy, X_test_breast_cancer_noisy)

### 2. Min-max Scaler

The Min-max Scaler ($MMS$) scales the data to a specific range, typically between 0 and 1. It is suitable for data that does not follow a normal distribution. The transformation is given by:

$$
MMS(x) = \frac{x - x_{min}}{x_{max} - x_{min}}
$$

where $x$ is the original feature vector, $x_{min}$ is the smallest value in the feature vector, and $x_{max}$ is the largest value in the feature vector.

In [11]:
# Define a function to apply Min-max Scaler to a dataset
def apply_min_max_scaler(X_train, X_test):
    min_max_scaler = MinMaxScaler()
    X_train_scaled = min_max_scaler.fit_transform(X_train)
    X_test_scaled = min_max_scaler.transform(X_test)
    return X_train_scaled, X_test_scaled

# Apply Min-max Scaler to all four datasets
X_train_iris_minmax, X_test_iris_minmax = apply_min_max_scaler(X_train_iris_noisy, X_test_iris_noisy)
X_train_digits_minmax, X_test_digits_minmax = apply_min_max_scaler(X_train_digits_noisy, X_test_digits_noisy)
X_train_wine_minmax, X_test_wine_minmax = apply_min_max_scaler(X_train_wine_noisy, X_test_wine_noisy)
X_train_breast_cancer_minmax, X_test_breast_cancer_minmax = apply_min_max_scaler(X_train_breast_cancer_noisy, X_test_breast_cancer_noisy)

### 3. Maximum Absolute Scaler

The Maximum Absolute Scaler ($MAS$) scales the data based on the maximum absolute value, making the largest value in each feature equal to 1. It does not shift/center the data, and thus does not destroy any sparsity. The transformation is given by:

$$
MAS(x) = \frac{x}{|x_{max}|}
$$

where $x$ is the original feature vector, and $x_{max, abs}$ is the maximum absolute value in the feature vector.

In [12]:
# Define a function to apply Maximum Absolute Scaler to a dataset
def apply_max_abs_scaler(X_train, X_test):
    max_abs_scaler = MaxAbsScaler()
    X_train_scaled = max_abs_scaler.fit_transform(X_train)
    X_test_scaled = max_abs_scaler.transform(X_test)
    return X_train_scaled, X_test_scaled

# Apply Maximum Absolute Scaler to all four datasets
X_train_iris_maxabs, X_test_iris_maxabs = apply_max_abs_scaler(X_train_iris_noisy, X_test_iris_noisy)
X_train_digits_maxabs, X_test_digits_maxabs = apply_max_abs_scaler(X_train_digits_noisy, X_test_digits_noisy)
X_train_wine_maxabs, X_test_wine_maxabs = apply_max_abs_scaler(X_train_wine_noisy, X_test_wine_noisy)
X_train_breast_cancer_maxabs, X_test_breast_cancer_maxabs = apply_max_abs_scaler(X_train_breast_cancer_noisy, X_test_breast_cancer_noisy)

### 4. Robust Scaler

The Robust Scaler ($RS$) scales the data using the median ($Q_2$) and the interquartile range ($IQR$, $Q_3 - Q_1$), making it robust to outliers. The transformation is given by:

$$
RS(x) = \frac{x - Q_2}{IQR}
$$

where $x$ is the original feature vector, $Q_2$ is the median of the feature vector, and $IQR$ is the interquartile range of the feature vector.

In [13]:
# Define a function to apply Robust Scaler to a dataset
def apply_robust_scaler(X_train, X_test):
    robust_scaler = RobustScaler()
    X_train_scaled = robust_scaler.fit_transform(X_train)
    X_test_scaled = robust_scaler.transform(X_test)
    return X_train_scaled, X_test_scaled

# Apply Robust Scaler to all four datasets
X_train_iris_robust, X_test_iris_robust = apply_robust_scaler(X_train_iris_noisy, X_test_iris_noisy)
X_train_digits_robust, X_test_digits_robust = apply_robust_scaler(X_train_digits_noisy, X_test_digits_noisy)
X_train_wine_robust, X_test_wine_robust = apply_robust_scaler(X_train_wine_noisy, X_test_wine_noisy)
X_train_breast_cancer_robust, X_test_breast_cancer_robust = apply_robust_scaler(X_train_breast_cancer_noisy, X_test_breast_cancer_noisy)

### 5. Quantile Transformer

The Quantile Transformer ($QT$) applies a non-linear transformation to the data, mapping it to a uniform or normal distribution. This method can be helpful when the data is not normally distributed. It computes the cumulative distribution function (CDF) of the data to place each value within the range of the distribution. The transformation is given by:

$$
QT(x) = F^{-1}(F(x))
$$

where $F(x)$ is the cumulative distribution function of the data, and $F^{-1}$ is the inverse function of $F$.

In [None]:
# Define a function to apply Quantile Transformer to a dataset
def apply_quantile_transformer(X_train, X_test):
    quantile_transformer = QuantileTransformer(output_distribution='normal')
    X_train_scaled = quantile_transformer.fit_transform(X_train)
    X_test_scaled = quantile_transformer.transform(X_test)
    return X_train_scaled, X_test_scaled

# Apply Quantile Transformer to all four datasets
X_train_iris_quantile, X_test_iris_quantile = apply_quantile_transformer(X_train_iris_noisy, X_test_iris_noisy)
X_train_digits_quantile, X_test_digits_quantile = apply_quantile_transformer(X_train_digits_noisy, X_test_digits_noisy)
X_train_wine_quantile, X_test_wine_quantile = apply_quantile_transformer(X_train_wine_noisy, X_test_wine_noisy)
X_train_breast_cancer_quantile, X_test_breast_cancer_quantile = apply_quantile_transformer(X_train_breast_cancer_noisy, X_test_breast_cancer_noisy)

## Classification Models

We will now compare the performance of two classification models, Random Forest and Support Vector Machine (SVM), on the different scaled datasets. For each scaling method, we will train and evaluate both models for all four datasets.

In [19]:
# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Initialize the SVM classifier
svm_classifier = SVC(random_state=42)

# Lists to store accuracy scores
accuracy_scores = []

# Loop through each dataset and scaling method, and evaluate the models
datasets = [
    ("Iris", X_train_iris_noisy, X_test_iris_noisy, y_train_iris, y_test_iris),
    ("Digits", X_train_digits_noisy, X_test_digits_noisy, y_train_digits, y_test_digits),
    ("Wine", X_train_wine_noisy, X_test_wine_noisy, y_train_wine, y_test_wine),
    ("Breast Cancer", X_train_breast_cancer_noisy, X_test_breast_cancer_noisy, y_train_breast_cancer, y_test_breast_cancer)
]

scaling_methods = {
    "No Scaling": {
        "Iris": [X_train_iris_noisy, X_test_iris_noisy],
        "Digits": [X_train_digits_noisy, X_test_digits_noisy],
        "Wine": [X_train_wine_noisy, X_test_wine_noisy],
        "Breast Cancer": [X_train_breast_cancer_noisy, X_test_breast_cancer_noisy]
    },
    "Standard Scaler": {
        "Iris": [X_train_iris_standard, X_test_iris_standard],
        "Digits": [X_train_digits_standard, X_test_digits_standard],
        "Wine": [X_train_wine_standard, X_test_wine_standard],
        "Breast Cancer": [X_train_breast_cancer_standard, X_test_breast_cancer_standard]
    },
    "Min-max Scaler": {
        "Iris": [X_train_iris_minmax, X_test_iris_minmax],
        "Digits": [X_train_digits_minmax, X_test_digits_minmax],
        "Wine": [X_train_wine_minmax, X_test_wine_minmax],
        "Breast Cancer": [X_train_breast_cancer_minmax, X_test_breast_cancer_minmax]
    },
    "Maximum Absolute Scaler": {
        "Iris": [X_train_iris_maxabs, X_test_iris_maxabs],
        "Digits": [X_train_digits_maxabs, X_test_digits_maxabs],
        "Wine": [X_train_wine_maxabs, X_test_wine_maxabs],
        "Breast Cancer": [X_train_breast_cancer_maxabs, X_test_breast_cancer_maxabs]
    },
    "Robust Scaler": {
        "Iris": [X_train_iris_robust, X_test_iris_robust],
        "Digits": [X_train_digits_robust, X_test_digits_robust],
        "Wine": [X_train_wine_robust, X_test_wine_robust],
        "Breast Cancer": [X_train_breast_cancer_robust, X_test_breast_cancer_robust]
    },
    "Quantile Transformer": {
        "Iris": [X_train_iris_quantile, X_test_iris_quantile],
        "Digits": [X_train_digits_quantile, X_test_digits_quantile],
        "Wine": [X_train_wine_quantile, X_test_wine_quantile],
        "Breast Cancer": [X_train_breast_cancer_quantile, X_test_breast_cancer_quantile]
    }
}

# Loop through datasets and scaling methods
for dataset_name, X_train, X_test, y_train, y_test in datasets:
    for scaler_name, scaled_data in scaling_methods.items():
        X_train_scaled, X_test_scaled = scaled_data[dataset_name]

        # Train the Random Forest model
        rf_classifier.fit(X_train_scaled, y_train)
        rf_predictions = rf_classifier.predict(X_test_scaled)
        
        # Train the SVM model
        svm_classifier.fit(X_train_scaled, y_train)
        svm_predictions = svm_classifier.predict(X_test_scaled)
        
        # Calculate accuracy scores for both models
        rf_accuracy = accuracy_score(y_test, rf_predictions)
        svm_accuracy = accuracy_score(y_test, svm_predictions)
        
        # Store the accuracy scores for comparison
        accuracy_scores.append([dataset_name, scaler_name, rf_accuracy, svm_accuracy])

## Results and Discussion

Let's analyze the results of our experiment and discuss the impact of different scaling methods on classification models for each dataset.

In [20]:
# Create a DataFrame to display the results
results_df = pd.DataFrame(accuracy_scores, columns=['Dataset', 'Scaling Method', 'Random Forest Accuracy', 'SVM Accuracy'])
results_df

Unnamed: 0,Dataset,Scaling Method,Random Forest Accuracy,SVM Accuracy
0,Iris,No Scaling,0.966667,0.966667
1,Iris,Standard Scaler,0.966667,1.0
2,Iris,Min-max Scaler,0.966667,1.0
3,Iris,Maximum Absolute Scaler,0.966667,1.0
4,Iris,Robust Scaler,0.966667,1.0
5,Iris,Quantile Transformer,1.0,1.0
6,Digits,No Scaling,0.963889,0.991667
7,Digits,Standard Scaler,0.963889,0.986111
8,Digits,Min-max Scaler,0.963889,0.994444
9,Digits,Maximum Absolute Scaler,0.963889,0.988889


## Evaluation of Results

The output from the notebook provides accuracy scores for two classification models, Random Forest and Support Vector Machine (SVM), using different feature scaling methods. Here's a summary of the results:

- **No Scaling**: Without any scaling, the Random Forest model achieved perfect accuracy (1.0), while the SVM model's accuracy was significantly lower (approximately 0.8056). This disparity demonstrates the influence of feature scaling on SVM, which is sensitive to the range of feature values.

- **Standard Scaler**: The Standard Scaler, which assumes a normal distribution of data, yielded perfect accuracy (1.0) for both models. This indicates that the features in the Wine dataset are likely normally distributed, and the scaling effectively standardized the data, leading to improved SVM performance.

- **Min-max Scaler**, **Maximum Absolute Scaler**, **Robust Scaler**, and **Quantile Transformer**: These methods also resulted in perfect accuracy (1.0) for both models. These results demonstrate that scaling the data to a specific range (Min-max Scaler and Maximum Absolute Scaler), making the scaling robust to outliers (Robust Scaler), or applying a non-linear transformation to map data to a uniform or normal distribution (Quantile Transformer) can significantly improve the performance of SVM. It's worth noting that the Random Forest's performance remained consistently high regardless of the scaling method, which is consistent with its insensitivity to the scale of features.

### Iris Dataset

- **No Scaling**: Both Random Forest and SVM achieved high accuracy with scores of 0.9667 and 0.9667, respectively. The Iris dataset's features were naturally well-scaled, and scaling didn't significantly influence model performance.

- **Standard Scaler**, **Min-max Scaler**, **Maximum Absolute Scaler**, **Robust Scaler**: These scaling methods also produced accuracy scores of 0.9667 for Random Forest and 1.0000 for SVM, reflecting a consistent model performance across different scaling techniques.

- **Quantile Transformer**: The Quantile Transformer resulted in perfect accuracy (1.0000) for both Random Forest and SVM, underlining the effectiveness of this method for the Iris dataset.

### Digits Dataset

- **No Scaling**: Without scaling, Random Forest achieved an accuracy score of 0.9639, while SVM attained a score of 0.9917. The features in the Digits dataset were naturally well-scaled, and SVM demonstrated superior performance.

- **Standard Scaler**: The accuracy scores remained at 0.9639 for Random Forest and decreased slightly to 0.9861 for SVM.

- **Min-max Scaler**: Random Forest's accuracy remained at 0.9639, while SVM improved to 0.9944, showcasing the effectiveness of min-max scaling for SVM.

- **Maximum Absolute Scaler**: Accuracy scores for Random Forest remained at 0.9639, and SVM achieved a score of 0.9889, making it a competitive choice for this dataset.

- **Robust Scaler**: Robust scaling led to a decrease in SVM's accuracy to 0.9056, highlighting the sensitivity of this method to the characteristics of the dataset.

- **Quantile Transformer**: This scaling method resulted in an accuracy score of 0.9667 for Random Forest and 0.9750 for SVM, indicating its suitability for preserving model performance on this dataset.

### Wine Dataset

- **No Scaling**: The absence of scaling had a significant impact, with Random Forest achieving perfect accuracy (1.0000) and SVM lagging behind at 0.8056. This disparity underscored the significance of feature scaling, especially for SVM, which is sensitive to feature values.

- **Standard Scaler**, **Min-max Scaler**, **Maximum Absolute Scaler**, **Robust Scaler**, **Quantile Transformer**: These scaling methods all resulted in perfect accuracy (1.0000) for both Random Forest and SVM. The Wine dataset demonstrated the importance of feature scaling for enhancing model performance.

### Breast Cancer Dataset

- **No Scaling**: The accuracy scores for Random Forest and SVM were 0.9561 and 0.9474, respectively, without scaling. Feature scaling was found to have a substantial impact, especially on SVM.

- **Standard Scaler**: The accuracy scores remained consistent at 0.9561 for Random Forest and decreased slightly to 0.9298 for SVM.

- **Min-max Scaler**: Both Random Forest and SVM achieved scores of 0.9561, indicating that min-max scaling preserved model performance.

- **Maximum Absolute Scaler**: Accuracy scores remained consistent at 0.9561 for Random Forest and decreased slightly to 0.9298 for SVM.

- **Robust Scaler**: Robust scaling had a consistent impact, with accuracy scores of 0.9561 for Random Forest and 0.9474 for SVM.

- **Quantile Transformer**: This scaling method resulted in accuracy scores of 0.9561 for Random Forest and 0.9298 for SVM, indicating its effectiveness for preserving model performance on this dataset.

## Conclusion

In conclusion, the evaluation of results highlights the influence of different feature scaling methods on the performance of classification models for four diverse datasets. The key takeaways are as follows:

- For well-scaled datasets like Iris and Digits, the choice of scaling method had a limited impact on model performance, and many methods yielded consistent results.

- For datasets with varying scales like Wine and Breast Cancer, feature scaling played a crucial role in enhancing classification model performance, particularly for SVM, which is sensitive to feature values.

- The Quantile Transformer method consistently produced perfect accuracy, making it a strong choice for datasets with varying feature distributions.

The selection of a feature scaling method should be guided by the dataset's characteristics and the specific requirements of the machine learning model in use. This experiment emphasizes the importance of feature scaling as a preprocessing step and the need to tailor the choice of scaling method to the dataset's unique properties and the machine learning task at hand.