# Preprocessing: feature scaling

## Previous steps

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

housing = pd.read_csv("./data/housing.csv")
train_set, test_set = train_test_split(housing, test_size=0.2,
    stratify=pd.cut(housing["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5]),
    random_state=42
    )

X_train = train_set.drop("median_house_value", axis=1)
y_train = train_set["median_house_value"].copy()
X_train_num = X_train.select_dtypes(include=[np.number])

## Scaling, normalization and standardization

Most Machine Learning algorithms don't work well when features have very different scales. For example, many classifiers calculate the distance between two points using Euclidean distance. If one of the features has much larger values than the others, the distance will be dominated by this feature. For example, in our dataset, the range of 'median_income' goes from 0 to 15, while the range of 'total_rooms' goes from 6 to 39,320.

To avoid this, it's common to scale the features.

The terminology can be confusing at this point. In general, **normalization** refers to changing the scale of data to fit a specific range, while **standardization** refers to changing the distribution of data to have a mean of 0 and a standard deviation of 1. In both cases, they are linear transformations that don't change the shape of the data distribution. In statistics there's usually a clear distinction between both terms, but in deep learning and computer vision, the terminology can be less consistent and **it's common to use "normalization" to refer to standardization**.

In [None]:
X_train.describe().T

## MinMaxScaler

The most common normalization is **min-max normalization** or **min-max scaling**. **Min-max normalization** is the simplest: values are scaled and shifted so that they end up in the range between a minimum value and a maximum value. Normally it will be between 0 and 1, although they can be others (neural networks usually work better with *inputs* with mean 0, so sometimes the range -1 to 1 is used). Scikit-Learn provides a `MinMaxScaler` class for this.

$$ X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}} $$

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(X_train_num)

Min-Max normalization is very sensitive to *outliers*, since a single very large value can completely change the scale of the data. In a situation where all data is between 20 and 30 but a single value of 100 appears, the maximum becomes 100, shifting all other values to a very low range. In general, Min-Max normalization should only be used if we're sure that the *outliers* are not errors. Techniques such as capping or Winsorizing can be applied beforehand to limit the impact of extreme values (see [Outliers and capped values](e2e020_eda.ipynb#Outliers-and-capped-values)).

## StandardScaler

On the other hand, **Z-score standardization** (***standard score***) is different: first it subtracts the mean (so it becomes 0), and then divides by the **standard deviation** so that the resulting distribution has standard deviation 1. Unlike min-max scaling, standardization doesn't limit values to a specific range, but this also has the advantage of being much less sensitive to outliers. Scikit-Learn provides a `StandardScaler` class for this.

$$ X_{std} = \frac{X - \mu}{\sigma} $$

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(X_train_num)

Many ML models work better by standardizing input *features* and it's a common and systematic practice in most cases (except for tree-based models). Scaling the *target* is less common, but can be useful in some cases, particularly for gradient-based models (such as neural networks) or distance-based models (such as KNN or SVM regressions).

For example, we could apply StandardScaler again to the labels:

In [None]:
target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(y_train.to_frame()) # convert the target to a dataframe (fit_transform expects 2D)
print(type(y_train)) # Since it's a single column, the labels were previously stored in a Series object
scaled_labels

## Scaling target variables and subsequent inversion

If we transform the target variable in any way, our model's output will also return transformed predictions. If we want the predictions to be on the original scale, we'll need to invert the transformation. Many of Scikit-Learn's transformers have an `inverse_transform()` method, which makes it easy to calculate the inverse of their transformations.

> **Note:** Not all transformations are invertible. For example:
> - **Numerical scalers** (`StandardScaler`, `MinMaxScaler`): Always invertibleâ€”they store the parameters needed to reverse the transformation.
> - **`OneHotEncoder`**: Technically invertible via `inverse_transform()`, but information loss occurs if the original categories aren't known.
> - **`OrdinalEncoder`**: Invertible if categories are preserved.
> - **Information-lossy transformations**: Some transformations like quantile-based discretization or heavy rounding cannot be perfectly inverted.

To give a simplified example, we're going to train a **simple linear regression with the most correlated predictor** (median_income) and the labels we just scaled. Then we'll test its predictions with the test set and undo the transformation.

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train[["median_income"]], scaled_labels) # train the model with scaled independent variables

some_new_data = X_train[["median_income"]].iloc[:5]  # for simplicity, we simulate new inputs to predict by taking 5 rows (we haven't preprocessed the test set)

scaled_predictions = model.predict(some_new_data)
target_scaler.inverse_transform(scaled_predictions) # Undo the transformation to get predictions on the original scale

This process can be simplified using the `TransformedTargetRegressor` class from Scikit-Learn, which allows training a model with transformed labels and undoing the transformation automatically.

In [None]:
from sklearn.compose import TransformedTargetRegressor

model = TransformedTargetRegressor(regressor = LinearRegression(),
                                   transformer = StandardScaler()) # dependent variable transformer
model.fit(X_train[["median_income"]], y_train)
model.predict(some_new_data)

## Heavy-tailed distributions

**Heavy-tailed distributions** (also called right-skewed or positively skewed distributions) have a long tail extending towards large values. This means a few data points take on extremely large values compared to the majority. Common examples include income, population counts, and house prices.

> **Note on Skewness**: While distributions can be skewed to the left (negative skew) or right (positive skew), **right-skewed distributions are far more common in real-world datasets**. This is because many variables (like prices, counts, or physical measurements) have a natural lower bound of zero but no upper bound, allowing extreme values only in the positive direction. Left-skewed distributions (e.g., age at death, test scores) are less frequent in this context and are usually handled differently (e.g., squaring or exponential transformation).

### Why heavy tails are problematic for ML models

1. **Gradient-based optimization issues**: Models trained with gradient descent (linear regression, logistic regression, neural networks) compute gradients based on the magnitude of feature values. Extreme values produce disproportionately large gradients, causing unstable updates and slow or erratic convergence.

2. **Distance-based distortion**: Algorithms that rely on distance metrics (KNN, K-Means, SVM with RBF kernel) are severely affected. A single feature with extreme values can dominate the distance calculation, making other features effectively irrelevant.

3. **Coefficient instability**: In linear models, features with large ranges lead to very small coefficients that are numerically unstable and harder to interpret.

4. **Outlier sensitivity**: Heavy tails inherently contain values that behave like outliers, pulling decision boundaries or regression lines away from the bulk of the data.

### Which models are most affected?

| Model Type | Sensitivity | Reason |
|------------|-------------|--------|
| **Linear/Logistic Regression** | High | Gradient magnitude, coefficient scale |
| **Neural Networks** | High | Gradient-based training, activation saturation |
| **KNN, K-Means** | High | Distance calculations dominated by large values |
| **SVM** | High | Distance-based kernel computations |
| **Tree-based (Random Forest, XGBoost)** | Low | Splits based on rank/threshold, not magnitude |
| **Naive Bayes** | Moderate | Depends on distribution assumptions |

### The logarithmic transformation

The logarithmic transformation compresses the long tail by mapping large values closer together while spreading out smaller values. This makes the distribution more symmetric (closer to normal), which:

- Reduces the influence of extreme values
- Stabilizes variance across the range
- Improves gradient behavior during optimization
- Makes distance calculations more meaningful

**Important**: After applying the logarithm, standard scaling is still recommended to ensure all features are on a comparable scale.

In [None]:
from matplotlib import pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler

fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# Original distribution
X_train["population"].hist(ax=axs[0, 0], bins=50)
axs[0, 0].set_xlabel("Population")
axs[0, 0].set_ylabel("Number of districts")

# Log transformation
log_pop = X_train["population"].apply(np.log)
log_pop.hist(ax=axs[0, 1], bins=50)
axs[0, 1].set_xlabel("Log of population")
axs[0, 1].set_ylabel("Number of districts")

# Standard scaling
scaler = StandardScaler()
scaled_pop = scaler.fit_transform(X_train["population"].values.reshape(-1, 1)).flatten()
axs[1, 0].hist(scaled_pop, bins=50)
axs[1, 0].set_xlabel("Scaled population")
axs[1, 0].set_ylabel("Number of districts")

# Log transformation + standard scaling
scaled_log_pop = scaler.fit_transform(log_pop.values.reshape(-1, 1)).flatten()
axs[1, 1].hist(scaled_log_pop, bins=50)
axs[1, 1].set_xlabel("Scaled log of population")
axs[1, 1].set_ylabel("Number of districts")

plt.tight_layout()
plt.show()