# Data Normalization

```text
- This is a preprocessing step in machine learning.
- It ensures that the values of a numeric column in the dataset have a common scale.
- It also ensures that all the features in the data are processed the same way.
- It ensures that the weights are numerically stable during training.
```

<br>

## Standardization

```text
- The data has a mean of 0 and a standard deviation of 1.
- It's used mostly for normally distributed data.
```

$$x_{scaled} = \frac{x_{i} - \bar{x}}{\sigma}$$

where
$\bar{x}$ is the mean and
$\sigma$ is the standard deviation.

<br>

## Normalization

```text
- The minimum and maximum values in the data are 0 and 1 respectively.
- It's used mostly for images and uniformly distributed data.
```

$$x_{scaled} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}}$$

where
$x_{min}$ and $x_{max}$ are the `minimum` and `maximum` values respectively.


<br><hr>

### Custom Normalization

```text
- The data can be normalized to have a certain minimum and maximum value.
```
$$x_{scaled} = \frac{x_{i} - x_{min}}{x_{max} - x_{min}}$$
$$x^* = a + x_{scaled}(b - a)$$

where
$x_{scaled}$ is has a value between 0 and 1 \
$x^*$ has a `minimum` and `maximum` value of $a$ and $b$ respectively.

In [1]:
# Built-in library
from typing import Any, Optional, Sequence, Union

# Standard imports
import numpy as np
import numpy.typing as npt
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

# Configure the backend
import matplotlib_inline.backend_inline

matplotlib_inline.backend_inline.set_matplotlib_formats("svg")
import seaborn as sns

# Custom import
from src.utilities import create_iris_data, smooth, load_data
from src.preprocessor import Standardizer


# Black code formatter (Optional)
%load_ext lab_black
# auto reload imports
%load_ext autoreload
%autoreload 2

In [2]:
fp = "../../data/wine_data.csv"

data = load_data(filename=fp)
data.head()

Shape of data: (1599, 12)



Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


#### Standardization

In [8]:
standardizer = Standardizer()
data_transf = standardizer.fit_transform(data.iloc[:, :-1])
data_transf.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246
1,-0.298547,1.967442,-1.391472,0.043416,0.223875,0.872638,0.624363,0.028261,-0.719933,0.12895,-0.584777
2,-0.298547,1.297065,-1.18607,-0.169427,0.096353,-0.083669,0.229047,0.134264,-0.331177,-0.048089,-0.584777
3,1.654856,-1.384443,1.484154,-0.453218,-0.26496,0.107592,0.4115,0.664277,-0.979104,-0.46118,-0.584777
4,-0.52836,0.961877,-1.391472,-0.453218,-0.243707,-0.466193,-0.379133,0.558274,1.288643,-0.579207,-0.960246


In [10]:
data_transf.describe().loc[["mean", "std"]]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
mean,3.554936e-16,1.733031e-16,-8.887339000000001e-17,-1.244227e-16,3.732682e-16,-6.221137e-17,4.4436690000000005e-17,-3.473172e-14,2.861723e-15,6.754377e-16,1.066481e-16
std,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313


In [3]:
class Normalizer:
    """This class is used to normalize the data.
    i.e. the result has a min and max value of 0 and 1 by default."""

    def __init__(self, min_value: float = 0, max_value: float = 1) -> None:
        self.min_value = min_value
        self.max_value = max_value
        self._min = 0
        self._max = 0

    def __repr__(self) -> str:
        return (
            f"{self.__class__.__name__}(min_value={self.min_value}, "
            f"max_value={self.max_value})"
        )

    @staticmethod
    def _normalize(
        X: Union[pd.DataFrame, npt.NDArray[np.float_]],
        min_: npt.NDArray[np.float_],
        max_: npt.NDArray[np.float_],
    ) -> float:
        """This is used to normalize the data."""
        return (X - min_) / (max_ - min_)

    def _custom_normalize(
        self,
        X: Union[pd.DataFrame, npt.NDArray[np.float_]],
        min_: npt.NDArray[np.float_],
        max_: npt.NDArray[np.float_],
    ) -> float:
        """This is used to adjust the min and max values."""
        x_scaled = self._normalize(X=X, min_=min_, max_=max_)
        x_adjusted = self.min_value + x_scaled * (self.max_value - self.min_value)
        return x_adjusted

    def fit(
        self, X: Union[pd.DataFrame, npt.NDArray[np.float_]], y=None
    ) -> Union[pd.DataFrame, npt.NDArray[np.float_]]:
        """This is used to learn the parameters,"""
        self._min = np.zeros(shape=X.shape[1])
        self._max = np.zeros(shape=X.shape[1])

        for idx, var in enumerate(X.columns):
            self._min[idx] = np.min(X[var])  # type: ignore
            self._max[idx] = np.max(X[var])  # type: ignore

        return self

    def transform(
        self, X: Union[pd.DataFrame, npt.NDArray[np.float_]], y=None
    ) -> Union[pd.DataFrame, npt.NDArray[np.float_]]:
        """This applies the transformation."""
        X = self._custom_normalize(X=X, min_=self._min, max_=self._max)
        return X

    def fit_transform(
        self, X: Union[pd.DataFrame, npt.NDArray[np.float_]], y=None
    ) -> Union[pd.DataFrame, npt.NDArray[np.float_]]:
        """This is used to learn the parameters and apply the transformation."""
        self.fit(X=X)
        X = self.transform(X=X)
        return X

In [4]:
normalizer = Normalizer(min_value=0, max_value=1)
data_transf = normalizer.fit_transform(data.iloc[:, :-1])
data_transf.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,0.247788,0.39726,0.0,0.068493,0.106845,0.140845,0.09894,0.567548,0.606299,0.137725,0.153846
1,0.283186,0.520548,0.0,0.116438,0.143573,0.338028,0.215548,0.494126,0.362205,0.209581,0.215385
2,0.283186,0.438356,0.04,0.09589,0.133556,0.197183,0.169611,0.508811,0.409449,0.191617,0.215385
3,0.584071,0.109589,0.56,0.068493,0.105175,0.225352,0.190813,0.582232,0.330709,0.149701,0.215385
4,0.247788,0.39726,0.0,0.068493,0.106845,0.140845,0.09894,0.567548,0.606299,0.137725,0.153846


In [5]:
data_transf.describe().loc[["min", "max"]]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


#### Custom Normalizer

```text
- The output should have the specified min and max values.
```

In [6]:
normalizer2 = Normalizer(min_value=2, max_value=5)
data_transf = normalizer2.fit_transform(data.iloc[:, :-1])
data_transf.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,2.743363,3.191781,2.0,2.205479,2.320534,2.422535,2.29682,3.702643,3.818898,2.413174,2.461538
1,2.849558,3.561644,2.0,2.349315,2.430718,3.014085,2.646643,3.482379,3.086614,2.628743,2.646154
2,2.849558,3.315068,2.12,2.287671,2.400668,2.591549,2.508834,3.526432,3.228346,2.57485,2.646154
3,3.752212,2.328767,3.68,2.205479,2.315526,2.676056,2.572438,3.746696,2.992126,2.449102,2.646154
4,2.743363,3.191781,2.0,2.205479,2.320534,2.422535,2.29682,3.702643,3.818898,2.413174,2.461538


In [7]:
data_transf.describe().loc[["min", "max"]]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
min,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
