# PL02. Normalization and Standardization of Palmer Penguins Dataset

## Performed by Víctor Vega Sobral
__Borja González Seoane. Machine Learning. Course 2024-25__

For this exercise, as in PL01, the Palmer Penguins dataset will be used again. You will find the CSV `palmer_penguins.csv` in the Virtual Campus. This dataset contains information about penguins of various different species.

In this exercise, taking into consideration the EDA analyses from PL01, you should work with Scikit-Learn to normalize and standardize the numerical columns of the dataset. It will probably be useful to use data visualizations to observe the changes from the applied transformations.


In [23]:
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import seaborn as sns

from sklearn.preprocessing import StandardScaler, MinMaxScaler

## Dataset loading

In [24]:
CSV_FILE = "palmer_penguins.csv"

df = pd.read_csv(CSV_FILE)

## Normalization

In [None]:
# Selection of numerical variables
numeric_columns = ["bill_depth_mm","flipper_length_mm","body_mass_g"]

In [None]:
# Check that it's not normalized
for column in numeric_columns:
    print("""CURRENT COLUMN:""", column)
    print(f"Maximum: {df[column].max()}")
    print(f"Mean: {df[column].mean()}")
    print(f"Minimum: {df[column].min()}")

COLUMNA ACTUAl: bill_depth_mm
Máximo: 21.5
Media: 17.151169590643274
Mínimo: 13.1
COLUMNA ACTUAl: flipper_length_mm
Máximo: 231.0
Media: 200.91520467836258
Mínimo: 172.0
COLUMNA ACTUAl: body_mass_g
Máximo: 6300.0
Media: 4201.754385964912
Mínimo: 2700.0


In [None]:
# Scaler initialization 
scaler = MinMaxScaler()

for column in numeric_columns:
    if column in df.columns:
        print(f"The current column is: {column.upper()}")
        # Normalize the column
        df[f'{columna}_norm_sklearn'] = scaler.fit_transform(df[[columna]])
        print("VALORES NORMALIZADOS:")
        print(f"Máximo: {df[f'{columna}_norm_sklearn'].max()}")
        print(f"Media: {df[f'{columna}_norm_sklearn'].mean()}")
        print(f"Mínimo: {df[f'{columna}_norm_sklearn'].min()}")

La columna actual es: BILL_DEPTH_MM
VALORES NORMALIZADOS:
Máximo: 1.0
Media: 0.48228209412419953
Mínimo: 0.0
La columna actual es: FLIPPER_LENGTH_MM
VALORES NORMALIZADOS:
Máximo: 1.0
Media: 0.4900882148875014
Mínimo: 0.0
La columna actual es: BODY_MASS_G
VALORES NORMALIZADOS:
Máximo: 1.0
Media: 0.4171539961013645
Mínimo: 0.0


In [28]:
print(f"Mínimo almacenado en el objeto scaler: {scaler.data_min_}")
print(f"Máximo almacenado en el objeto scaler: {scaler.data_max_}")

Mínimo almacenado en el objeto scaler: [2700.]
Máximo almacenado en el objeto scaler: [6300.]


## Standardization

In [None]:
# Function to standardize directly 
# with the formula from the slides
def standardize(df):
    return (df - df.mean()) / df.std()

for column in numeric_columns:
    df[f"{column}_standard_formula"] = standardize(df[column])

# Z-score standardization


In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

numeric_columns = ["bill_depth_mm", "flipper_length_mm", "body_mass_g"]
# Create a StandardScaler instance
scaler = StandardScaler()
# Fit the scaler to the numeric columns and transform them
df_scaled = scaler.fit_transform(df[numeric_columns])
# Convert the result back to a DataFrame
df_scaled_df = pd.DataFrame(df_scaled, columns=numeric_columns)
# Replace the original columns with the normalized ones in the original DataFrame
df[numeric_columns] = df_scaled_df
# Print statistics of the original DataFrame
print("Original DataFrame statistics:")
print("Means:\n", df[numeric_columns].mean())
print("Standard deviations:\n", df[numeric_columns].std())
# Print statistics of the normalized DataFrame
print("\nNormalized DataFrame statistics:")
print("Means:\n", df_scaled_df.mean())
print("Standard deviations:\n", df_scaled_df.std())

Estadísticas del DataFrame original:
Medias:
 bill_depth_mm       -0.001041
flipper_length_mm   -0.002759
body_mass_g         -0.002925
dtype: float64
Desviaciones estándar:
 bill_depth_mm        1.001872
flipper_length_mm    1.000626
body_mass_g          0.997604
dtype: float64

Estadísticas del DataFrame normalizado:
Medias:
 bill_depth_mm        4.155221e-16
flipper_length_mm   -8.310441e-16
body_mass_g          8.310441e-17
dtype: float64
Desviaciones estándar:
 bill_depth_mm        1.001465
flipper_length_mm    1.001465
body_mass_g          1.001465
dtype: float64


## Conclusions


...

Normalization and standardization are key techniques for preprocessing data before applying machine learning algorithms. These techniques allow improving the performance and accuracy of models by ensuring that the different characteristics of the data are on a comparable scale.

By selecting the numerical columns in a list and using loops we can normalize and standardize all of them in many fewer lines.