---
title: "Comprehensive Study on the Impact of Feature Scaling on Classification Models"
author: "Sherry Thomas"
format:
  html:
    theme: theme.scss
    toc: true
    html-math-method: katex
---

## Introduction

In the realm of machine learning, feature scaling is a crucial preprocessing step that can significantly influence the performance of classification models. It involves transforming the data to a common scale, ensuring that no single feature dominates the learning process due to its range of values. This notebook presents an exhaustive exploration of the impact of various feature scaling methods on classification models. We will focus on five commonly used techniques:

1. Standard Scaler
2. Min-max Scaler
3. Maximum Absolute Scaler
4. Robust Scaler
5. Quantile Transformer

We will use the Wine dataset from scikit-learn, a frequently employed dataset for classification tasks, to demonstrate the effects of these scaling methods. The Wine dataset contains information about different wines and their classification into one of three classes.

## Importing Necessary Libraries

Before we begin, we need to import the required libraries for data manipulation, visualization, and machine learning.

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler, QuantileTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

## Loading the Wine Dataset

We start by loading the Wine dataset and inspecting its structure.

In [13]:
# Load the Wine dataset
wine = load_wine()

# Create a DataFrame for the dataset
wine_df = pd.DataFrame(data=np.c_[wine['data'], wine['target']], columns=wine['feature_names'] + ['target'])

# Display the first few rows of the dataset
wine_df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0.0


The Wine dataset consists of various features related to wine properties and a 'target' column indicating the class of the wine.

## Data Preprocessing

Before we proceed with feature scaling, we need to split the data into training and testing sets. Additionally, to make our study more robust and thorough, we will create a noisy version of the Wine dataset by adding random noise to the feature values. This noisy dataset will introduce variations that can better showcase the effects of different scaling methods on classification model performance.

In [14]:
# Split the data into features (X) and target (y)
X = wine.data
y = wine.target

# Adding random noise to the dataset
np.random.seed(42)
noise = np.random.normal(0, 0.2, size=X.shape)
X_noisy = X + noise

# Split the noisy data into training and testing sets
X_train_noisy, X_test_noisy, y_train, y_test = train_test_split(X_noisy, y, test_size=0.2, random_state=42)

## Feature Scaling Methods

### 1. Standard Scaler

The Standard Scaler ($SS$) transforms the data so that it has a mean ($\mu$) of 0 and a standard deviation ($\sigma$) of 1. This method assumes that the data is normally distributed. The transformation is given by:

$$
SS(x) = \frac{x - \mu}{\sigma}
$$

where $x$ is the original feature vector, $\mu$ is the mean of the feature vector, and $\sigma$ is the standard deviation of the feature vector.

In [15]:
# Initialize the Standard Scaler
standard_scaler = StandardScaler()

# Fit and transform the training data
X_train_standard = standard_scaler.fit_transform(X_train_noisy)

# Transform the test data using the same scaler
X_test_standard = standard_scaler.transform(X_test_noisy)

### 2. Min-max Scaler

The Min-max Scaler ($MMS$) scales the data to a specific range, typically between 0 and 1. It is suitable for data that does not follow a normal distribution. The transformation is given by:

$$
MMS(x) = \frac{x - x_{min}}{x_{max} - x_{min}}
$$

where $x$ is the original feature vector, $x_{min}$ is the smallest value in the feature vector, and $x_{max}$ is the largest value in the feature vector.

In [16]:
# Initialize the Min-max Scaler
min_max_scaler = MinMaxScaler()

# Fit and transform the training data
X_train_minmax = min_max_scaler.fit_transform(X_train_noisy)

# Transform the test data using the same scaler
X_test_minmax = min_max_scaler.transform(X_test_noisy)

### 3. Maximum Absolute Scaler

The Maximum Absolute Scaler ($MAS$) scales the data based on the maximum absolute value, making the largest value in each feature equal to 1. It does not shift/center the data, and thus does not destroy any sparsity. The transformation is given by:

$$
MAS(x) = \frac{x}{|x_{max}|}
$$

where $x$ is the original feature vector, and $x_{max, abs}$ is the maximum absolute value in the feature vector.

In [17]:
# Initialize the Maximum Absolute Scaler
max_abs_scaler = MaxAbsScaler()

# Fit and transform the training data
X_train_maxabs = max_abs_scaler.fit_transform(X_train_noisy)

# Transform the test data using the same scaler
X_test_maxabs = max_abs_scaler.transform(X_test_noisy)

### 4. Robust Scaler

The Robust Scaler ($RS$) scales the data using the median ($Q_2$) and the interquartile range ($IQR$, $Q_3 - Q_1$), making it robust to outliers. The transformation is given by:

$$
RS(x) = \frac{x - Q_2}{IQR}
$$

where $x$ is the original feature vector, $Q_2$ is the median of the feature vector, and $IQR$ is the interquartile range of the feature vector.

In [18]:
# Initialize the Robust Scaler
robust_scaler = RobustScaler()

# Fit and transform the training data
X_train_robust = robust_scaler.fit_transform(X_train_noisy)

# Transform the test data using the same scaler
X_test_robust = robust_scaler.transform(X_test_noisy)

### 5. Quantile Transformer

The Quantile Transformer ($QT$) applies a non-linear transformation to the data, mapping it to a uniform or normal distribution. This method can be helpful when the data is not normally distributed. It computes the cumulative distribution function (CDF) of the data to place each value within the range of the distribution. The transformation is given by:

$$
QT(x) = F^{-1}(F(x))
$$

where $F(x)$ is the cumulative distribution function of the data, and $F^{-1}$ is the inverse function of $F$.

In [None]:
# Initialize the Quantile Transformer
quantile_transformer = QuantileTransformer(output_distribution='normal')

# Fit and transform the training data
X_train_quantile = quantile_transformer.fit_transform(X_train_noisy)

# Transform the test data using the same scaler
X_test_quantile = quantile_transformer.transform(X_test_noisy)

## Classification Models

We will now compare the performance of two classification models, Random Forest and Support Vector Machine (SVM), on the different scaled datasets. For each scaling method, we will train and evaluate both models.

In [20]:
# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Initialize the SVM classifier
svm_classifier = SVC(random_state=42)

# Lists to store accuracy scores
accuracy_scores = []

# Loop through each scaled dataset and evaluate the models
for X_train_scaled, X_test_scaled, scaler_name in zip(
    [X_train_noisy, X_train_standard, X_train_minmax, X_train_maxabs, X_train_robust, X_train_quantile],
    [X_test_noisy, X_test_standard, X_test_minmax, X_test_maxabs, X_test_robust, X_test_quantile],
    ["No Scaling", "Standard Scaler", "Min-max Scaler", "Maximum Absolute Scaler", "Robust Scaler", "Quantile Transformer"]
):
    # Train the Random Forest model
    rf_classifier.fit(X_train_scaled, y_train)
    rf_predictions = rf_classifier.predict(X_test_scaled)
    
    # Train the SVM model
    svm_classifier.fit(X_train_scaled, y_train)
    svm_predictions = svm_classifier.predict(X_test_scaled)
    
    # Calculate accuracy scores for both models
    rf_accuracy = accuracy_score(y_test, rf_predictions)
    svm_accuracy = accuracy_score(y_test, svm_predictions)
    
    # Store the accuracy scores for comparison
    accuracy_scores.append([scaler_name, rf_accuracy, svm_accuracy])

## Results and Discussion

Let's analyze the results of our experiment and discuss the impact of different scaling methods on classification models.

In [21]:
# Create a DataFrame to display the results
results_df = pd.DataFrame(accuracy_scores, columns=['Scaling Method', 'Random Forest Accuracy', 'SVM Accuracy'])
results_df

Unnamed: 0,Scaling Method,Random Forest Accuracy,SVM Accuracy
0,No Scaling,1.0,0.805556
1,Standard Scaler,1.0,1.0
2,Min-max Scaler,1.0,1.0
3,Maximum Absolute Scaler,1.0,1.0
4,Robust Scaler,1.0,1.0
5,Quantile Transformer,1.0,1.0


## Evaluation of Results

The output from the notebook provides accuracy scores for two classification models, Random Forest and Support Vector Machine (SVM), using different feature scaling methods. Here's a summary of the results:

- **No Scaling**: Without any scaling, the Random Forest model achieved perfect accuracy (1.0), while the SVM model's accuracy was significantly lower (approximately 0.8056). This disparity demonstrates the influence of feature scaling on SVM, which is sensitive to the range of feature values.

- **Standard Scaler**: The Standard Scaler, which assumes a normal distribution of data, yielded perfect accuracy (1.0) for both models. This indicates that the features in the Wine dataset are likely normally distributed, and the scaling effectively standardized the data, leading to improved SVM performance.

- **Min-max Scaler**, **Maximum Absolute Scaler**, **Robust Scaler**, and **Quantile Transformer**: These methods also resulted in perfect accuracy (1.0) for both models. These results demonstrate that scaling the data to a specific range (Min-max Scaler and Maximum Absolute Scaler), making the scaling robust to outliers (Robust Scaler), or applying a non-linear transformation to map data to a uniform or normal distribution (Quantile Transformer) can significantly improve the performance of SVM. It's worth noting that the Random Forest's performance remained consistently high regardless of the scaling method, which is consistent with its insensitivity to the scale of features.

## Conclusion

In conclusion, this notebook provides a comprehensive study on the impact of feature scaling on classification models. It demonstrates that the choice of feature scaling method can significantly influence the performance of a model, especially for models like SVM that are sensitive to the range of feature values.

Without scaling, SVM's performance was significantly lower compared to other methods. However, with the application of different scaling methods, SVM's performance improved drastically, achieving perfect accuracy. This highlights the importance of feature scaling in preprocessing, particularly when using models sensitive to the scale of input features.

The Standard Scaler, Min-max Scaler, Maximum Absolute Scaler, Robust Scaler, and Quantile Transformer all proved to be effective for the noisy Wine dataset. However, the effectiveness of a scaling method can vary based on the characteristics of the dataset and the specific machine learning model being used.

When working with real-world datasets, it's essential to experiment with different scaling techniques and select the one that best fits the data's distribution and the requirements of the machine learning model. This decision should be data-driven and based on a thorough understanding of the data's characteristics.

This experiment underscores the importance of feature scaling as a preprocessing step and the need to consider the specific scaling method in the broader context of machine learning tasks.

## Reference

* [“The choice of scaling technique matters for classification performance”](https://arxiv.org/pdf/2212.12343.pdf) by Lucas B.V. de Amorima, b, George D.C. Cavalcantia, Rafael M.O. Cruz