Feature 0 (median income in a block) and feature 5 (number of households) of the California housing dataset have very different scales and contain some very large outliers. These two characteristics lead to difficulties to visualize the data and, more importantly, they can degrade the predictive performance of many machine learning algorithms. Unscaled data can also slow down or even prevent the convergence of many gradient-based estimators.

Indeed many estimators are designed with the assumption that each feature takes values close to zero or more importantly that all features vary on comparable scales. In particular, metric-based and gradient-based estimators often assume approximately standardized data (centered features with unit variances). A notable exception are decision tree-based estimators that are robust to arbitrary scaling of the data.

This example uses different scalers, transformers, and normalizers to bring the data within a pre-defined range.

Scalers are linear (or more precisely affine) transformers and differ from each other in the way to estimate the parameters used to shift and scale each feature.

QuantileTransformer provides non-linear transformations in which distances between marginal outliers and inliers are shrunk. PowerTransformer provides non-linear transformations in which data is mapped to a normal distribution to stabilize variance and minimize skewness.

Unlike the previous transformations, normalization refers to a per sample transformation instead of a per feature transformation.

In [None]:
import numpy as np

import os, sys, plotly.graph_objects as go
import plotly.figure_factory as ff
module_path = os.path.abspath(os.path.join('../../../../..'))
if module_path not in sys.path:
    sys.path.append(module_path) 

from sklearn.naive_bayes import GaussianNB
from sklearn import datasets 
from erudition.learning.modules.sklearn.GeneralizedLinearModels.helper import helper
from erudition.learning.helpers.plots.plotly_render import render, scatter

from sklearn.datasets import fetch_california_housing, load_boston

In [None]:
dataset = load_boston()

X_full, y_full = dataset.data, dataset.target

In [None]:
# Take only 2 features to make visualization easier
# Feature of 0 has a long tail distribution.
# Feature 5 has a few but very large outliers.

X = X_full[:, [0, 5]]
X.shape

In [None]:
def plot(X): 
    colorscale = ['#AAAAAA', '#FFFFFF', '#FFFFFF', (1, 1, 0.2), (0.98,0.98,0.98)]

    fig = ff.create_2d_density(
        X[:,0], X[:,1],
        point_size=3,
        colorscale = 'Greys',
        hist_color=colorscale
    )

    render(fig, title='Distribution', width=900)

In [None]:
plot(X)

# Standard Scalar

In [None]:
from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()
X_scaled = scalar.fit_transform(X)

In [None]:
plot(X_scaled)

# Quantile Transformer

In [None]:
from sklearn.preprocessing import QuantileTransformer

trans = QuantileTransformer(output_distribution='normal')
X_trans = trans.fit_transform(X)

In [None]:
plot(X_trans)

# PowerTransformer

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

In [None]:
from sklearn.preprocessing import PowerTransformer

trans = PowerTransformer(method='box-cox')
X_trans = trans.fit_transform(X)

In [None]:
plot(X_trans)