# Self-study try-it activity 8.1: Normalising data sets

### Data preprocessing: normalisation and scaling

In machine learning, scaling and normalisation are fundamental preprocessing techniques that ensure all features contribute proportionally to the model. These steps adjust the range and distribution of feature values, which improves model performance, speeds up convergence and enhances interpretability.

Normalisation (min-max scaling) rescales features to a fixed range, preserving the shape of the original distribution.
$$
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
$$

Standardisation (z-score scaling) transforms features to have zero mean and unit variance, making them suitable for algorithms that are sensitive to feature variance.


$$
\text{mean} = \frac{1}{n} \sum_{i=1}^n x_i
$$

$$
\text{std} = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \text{mean})^2}
$$

$$
z = \frac{x - \text{mean}}{\text{std}}
$$

This activity consists of two parts:

- Using a manual approach to compute normalisation and the z-score

- Using built-in functions in `sklearn` to compute normalisation and the z-score

In [None]:
# Import the necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

### Part one: Using a manual approach to compute normalisation and the z-score

In [None]:
# Original data
data = [
    [1.3, 5.1, 3.1],
    [5.2, 3.8, 2.9],
    [100.8, 4.2, 1.4],
    [2.7, 1.1, 4.1],
    [3.1, -123.2, 3.0]
]



In [None]:
# Manual normalisation
features = list(zip(*data))

normalized_features = []
for feature in features:
    min_val = min(feature)
    max_val = max(feature)
    normalized_col = [(x - min_val) / (max_val - min_val) for x in feature]
    normalized_features.append(normalized_col)

normalized_data = list(zip(*normalized_features))

print("Normalized Data (Min-Max Scaling):")
for i, row in enumerate(normalized_data, 1):
    print(f"i={i}\t" + "\t".join(f"{v:.6f}" for v in row))



In [None]:
# Manual z-score standardisation
zscore_features = []
for feature in features:
    mean = sum(feature) / len(feature)
    variance = sum((x - mean) ** 2 for x in feature) / len(feature)
    std_dev = variance ** 0.5
    zscore_col = [(x - mean) / std_dev for x in feature]
    zscore_features.append(zscore_col)

# Transpose it back to the original row-wise format
zscore_data = list(zip(*zscore_features))

print("\nZ-score Standardized Data:")
for i, row in enumerate(zscore_data, 1):
    print(f"i={i}\t" + "\t".join(f"{v:.6f}" for v in row))


### Part two: Using built-in functions in `sklearn` to compute normalisation and the z-score

In `sklearn`, `MinMaxScaler` normalises data to a fixed range, preserving the original distribution's shape but not mitigating the impact of outliers.

`StandardScaler` standardises features by removing the mean and scaling to unit variance, ensuring that no feature dominates due to scale differences. It is particularly useful for algorithms that assume normally distributed data or are sensitive to feature scales.

Both scalers are implemented as classes in `sklearn.preprocessing` and can be integrated into machine learning pipelines for efficient preprocessing.

In [None]:
# Create the data set as a pandas DataFrame
data = {
    'Feature1': [1.3, 5.2, 100.8, 2.7, 3.1],
    'Feature2': [5.1, 3.8, 4.2, 1.1, -123.2],
    'Feature3': [3.1, 2.9, 1.4, 4.1, 3.0]
}

In [None]:


df = pd.DataFrame(data, index=[1, 2, 3, 4, 5])
print("Original Data:")
print(df)

# --- Normalisation (min-max scaling) ---
scaler_minmax = MinMaxScaler()
normalized_data = scaler_minmax.fit_transform(df)
df_normalized = pd.DataFrame(normalized_data, columns=df.columns, index=df.index)

print("\nNormalized Data (Min-Max Scaling):")
print(df_normalized)

# --- Z-score standardisation ---
scaler_standard = StandardScaler()
zscore_data = scaler_standard.fit_transform(df)
df_zscore = pd.DataFrame(zscore_data, columns=df.columns, index=df.index)

print("\nZ-score Standardized Data:")
print(df_zscore)


### To-do:

1. Generate a random 10 by 2 data set with values in the range [0,1000].

2. Initialise the scalars and assign it to `minmax_scaler` and `standard_scaler`, respectively.

3. Apply the scalar functions, and display the original data and scaled data.

In [None]:


# Generate a random 10 by 2 data set with values in the range [0, 1000)
np.random.seed(42)  # for reproducibility
data = np.random.randint(0, 1000, size=(10, 2))

print("Original Data:\n", data)

# Initialise the scalers
minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()

# Apply MinMaxScaler
data_minmax = minmax_scaler.fit_transform(data)
print("\nMinMax Scaled Data:\n", data_minmax)

# Apply StandardScaler
data_standard = standard_scaler.fit_transform(data)
print("\nStandard Scaled Data:\n", data_standard)


### To-do:

Generate a new data point and predict its scalar values.

In [None]:
# Example new unseen sample (two features)
new_sample = np.array([[450, 600]])

# Transform it using MinMaxScaler fitted on the original data
new_sample_minmax = minmax_scaler.transform(new_sample)
print("New sample after MinMax scaling:", new_sample_minmax)

# Transform it using StandardScaler fitted on the original data
new_sample_standard = standard_scaler.transform(new_sample)
print("New sample after Standard scaling:", new_sample_standard)
